[Databricks on AWS #4] The BOOTSTRAP_TIMEOUT Mystery: Tracing a Databricks Cluster from Data Plane to Control Plane (Transit Gateway + Firewall)

#databricks #aws #networking #terraform

📚 Series: Databricks on AWS (Part 4)

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery ← you are here

Fixing It with AWS PrivateLink

How We Structure the Terraform

The EC2 nodes were healthy — 3/3 status checks. The cluster still never started. Here's the 11-minute timeout that sent us tracing packets across three AWS accounts.

Most "my Databricks cluster won't start" posts end with "open port 443." This one didn't — because the firewall did allow 443. The traffic was dying somewhere else, and the only way to find it was to follow a single packet from the cluster node all the way to the Databricks control plane.

If you run Databricks classic compute inside a customer-managed VPC behind a centralized egress (Transit Gateway → inspection firewall), this is the failure mode nobody warns you about.

The setup

A Databricks workspace deployment on AWS, classic compute, customer-managed VPC, secure cluster connectivity (no public IPs on cluster nodes). The new workspace reused an existing "spoke" VPC that egresses through a shared network hub:

cluster node (no public IP)
  → spoke VPC route table: 0.0.0.0/0 → Transit Gateway
  → TGW (shared network-hub account)
  → DMZ VPC → GWLB inspection firewall → NAT → IGW → internet → Databricks control plane

Instance pools applied. Cluster policies applied. Then the cluster itself:

$ atlantis apply -d .../ws-landing/compute
databricks_cluster.this["shared_small"]: Still creating... [10m20s elapsed]
...
Error: cannot create cluster: failed to reach RUNNING, got TERMINATED:
  Self-bootstrap timed out during launch ... BOOTSTRAP_TIMEOUT

~25 minutes of "Still creating", then TERMINATED.

The symptom that rules out half the internet

I pulled the cluster event log. The line that mattered:

BOOTSTRAP_TIMEOUT: [id: InstanceId(i-xxxx), status: INSTANCE_INITIALIZING, ...]
timed out after 704524 milliseconds. AWS bootstrap diagnostic output could not be fetched.
Please check network connectivity from the data plane to the control plane.

Two things narrowed it down fast:

The EC2 nodes were up and healthy — 3/3 status checks in the AWS console, no public IP (as expected for secure cluster connectivity).
INSTANCE_INITIALIZING + "diagnostic output could not be fetched" — the control plane couldn't even reach the node to pull logs.

Healthy EC2 + cluster never reaches RUNNING = the node booted but couldn't phone home. Not capacity. Not IAM. Not Terraform. Pure egress: the node never opened its outbound tunnel to the control plane's secure cluster connectivity relay.

Databricks' own error literally says "check network connectivity from the data plane to the control plane." Believe it.

Why "just use VPC endpoints" doesn't fix this

The first instinct (and our security team's) was: route everything through VPC endpoints, no internet. That's correct for AWS services — S3, STS, Kinesis can all ride the AWS backbone via gateway/interface endpoints.

But the thing that's actually blocked is the Databricks control plane and the secure cluster connectivity relay. Those are Databricks-owned infrastructure, not an AWS service — there is no com.amazonaws... endpoint for them. Your options are exactly two:

allow outbound to the control plane (egress firewall), or
AWS PrivateLink to Databricks (backbone, but you still explicitly wire it).

There is no configuration where the node simply doesn't talk to the control plane. With secure cluster connectivity, the node has no public IP and no open inbound ports — so the control plane cannot initiate inward. The node has to go outbound — block that path and the cluster simply can't come up.

Tracing the path

Routing first. I walked every hop and confirmed each route table — including the new workspace CIDR — was correct end to end:

hop	what to check	result
spoke subnet RT	`0.0.0.0/0 → TGW`	✅
TGW route table	default → DMZ VPC; return → spoke (propagated)	✅
DMZ ingress subnet	`0.0.0.0/0 → GWLB endpoint` (inspection)	✅
DMZ GWLB subnet	`0.0.0.0/0 → NAT`; spoke CIDR → TGW (return)	✅
DMZ public subnet	`0.0.0.0/0 → IGW`; spoke CIDR → GWLB (return)	✅

Every route — outbound and return — was present, and the new CIDR was wired identically to the existing workspaces that worked. So routing wasn't it.

That left exactly one thing in the path that isn't a route: the inspection firewall behind the Gateway Load Balancer. The traffic reached the firewall and got dropped there.

What actually broke

The DMZ runs a centralized egress firewall (think Palo Alto / appliance behind a GWLB). It allow-lists by source CIDR and destination. The existing workspaces lived in older CIDR ranges that were already in the allow policy. Our new workspace CIDR was not — so its packets to the Databricks control plane hit the firewall and were silently dropped.

Healthy EC2, perfect routing, and a firewall that quietly eats the SYN. That's why the node sat in INSTANCE_INITIALIZING for 11 minutes and the control plane never heard from it.

The confirmation is in the firewall logs: filter by the new source CIDR and you see DENY entries to the Databricks relay on 443.

The Databricks outbound you actually need

For the data plane to bootstrap (region <region> here; check the docs for yours), the node needs to reach:

destination	port	why
`tunnel.<region>.cloud.databricks.com` (control plane CIDR, e.g. `3.x.x.x/28`)	443	secure cluster connectivity relay (the one that was blocked)
control plane / web app	443	registration
regional metastore (RDS)	3306	cluster launch
S3 / STS / Kinesis (regional)	443	runtime, credentials, logs — put these on VPC endpoints

One Palo Alto gotcha worth its own line: if you do SSL forward-proxy decryption, exclude the Databricks domains. The relay uses certificate pinning; decrypt it and it breaks even with the allow rule in place.

Takeaways

BOOTSTRAP_TIMEOUT + healthy EC2 + "diagnostic output could not be fetched" = data-plane → control-plane egress is blocked, full stop. Don't go hunting IAM or Terraform.
Secure cluster connectivity means the node has no inbound path — it must egress to the relay. That egress is not optional.
VPC endpoints solve S3/STS/Kinesis, not the control plane/relay. Those are Databricks-owned; allow them or use PrivateLink.
When you add a new workspace CIDR behind a centralized firewall, the firewall policy is the thing everyone forgets. New CIDR ≠ automatically allowed.

The clean long-term fix is to take the control-plane traffic off the internet entirely with AWS PrivateLink — without touching the existing VPC. That's Part 5.

Next: Fixing it with AWS PrivateLink — control-plane over the backbone, zero new subnets, zero routing changes.