Setting up Ray on GKE: How I spent a week optimising Docker pulls?

#llmops #ray #kubernetes #gcp

I spent a week debugging slow Ray cluster starts on GKE. The fix was a region mismatch that is not very obvious from the docs.

We've been running Ray on GKE (with Anyscale) for over a year on the AI Platform team at Geotab. As self-hosted LLM workloads grow, Ray is one of the tools that makes scaling them practical. Introducing Ray and making it a go-to platform for multiple teams has been a rewarding but challenging path. One issue I kept running into: slow Ray cluster spawn times. Here's where the time actually went, and what helped.

1. GKE node provisioning: 2-3 minutes

When Ray's autoscaler asks for a new node, GKE has to allocate a VM, boot the OS, register the kubelet, and join the cluster. GPU nodes add another 30-50 seconds for driver install. We treated this as a baseline cost - no point optimizing anything else until the node exists. That recently changed a bit as GCP introduced GKE Active Buffer that aims to minimize that time. I haven't tested it yet, but it's on the list.

2. Image pull: 10+ minutes (and where I lost a week)

Ray + ML container images are big. An LLM-flavored image hits 10-15 GB easily; even a classic CV image with PyTorch lands at 13 GB+. Pulling that fresh on every new node was 15-20 minutes.

GKE Image Streaming is supposed to fix this as containers start before the full image is pulled. However, even after enabling it, pulls were still occasionally taking 20+ minutes.

What made this brutal to debug:

It didn't fail consistently. Users didn't always report it.
Anyscale assigns new pod names on each restart, so by the time I went looking, the original pod was gone and pulling logs on pod level was impossible.
The log volume is high. Without precise timestamps, finding the relevant entries is painful.

The detail not obvious from the docs: Image Streaming requires your Artifact Registry repo to be in the same region as your GKE nodes. Cluster in us-central1 + repo in the us multi-region doesn’t enable streaming and it silently falls back to a normal pull.

The very first step is to ensure that Image Streaming is actually enabled:

gcloud container clusters describe <cluster_name> \
   --location=<control_plane_location> \
   --flatten "nodePoolDefaults.nodeConfigDefaults"

Look for:

gsfsConfig:
  enabled: True

To verify it's actually engaging on a specific node, this Cloud Logging query was what cracked the case for me:

resource.type="k8s_node"
resource.labels.node_name="<name_of_node>"

The logs showed Image Streaming was enabled but not engaging, which led me to the regional requirement.

3. Disk speed on the nodes themselves

Once images do pull, they have to land on disk. We were using HDD-backed nodes. Switching to SSD cut Docker load time by ~30% and brought total spawn time from 15-20 minutes down to 5-6.

Unglamorous, but worth checking. If you're on HDD, you're paying for it on every cold start.

TLDR:

If your Ray cluster spawns feel slow on GKE, the diagnostic order I'd suggest:

Confirm Image Streaming is actually engaging (don't trust "enabled" - check logs).
Verify your Artifact Registry region matches your cluster region.
Check what disk type your node pool is using.

DEV Community

Setting up Ray on GKE: How I spent a week optimising Docker pulls?

Top comments (0)