DEV Community

Cover image for The GPU Cold Starts Nobody Warns You About: Autoscaling LLM Inference on Kubernetes
Manikandan T
Manikandan T

Posted on • Originally published at Medium

The GPU Cold Starts Nobody Warns You About: Autoscaling LLM Inference on Kubernetes

The Problem

When an LLM inference pod scales on Kubernetes, it doesn't start in seconds like a CPU service. It hits a sequential chain of bottlenecks that can stall Time-to-First-Token (TTFT) for 10+ minutes:

Each phase blocks the next. Optimizing one shifts the bottleneck downstream - you must address all four.

The reason these traps hurt so much is that a GPU inference pod doesn't come up in seconds like a stateless web service. It moves through a sequential chain where each stage blocks the next:

Modern inference containers (vLLM + PyTorch + CUDA + NCCL) are 10–20GB. Model weights for production models (Llama-3 70B, DeepSeek-V3) exceed 130GB. This is the reality of GPU autoscaling - every cold start moves tens of gigabytes before generating a single token.

This post covers practical solutions for each phase, issues I hit during implementation, and cloud-native alternatives across providers.


Phase 1: Node Provisioning

Karpenter Over Cluster Autoscaler

If you're still on Cluster Autoscaler for inference workloads, switch to Karpenter. The key differences that matter for GPU scaling:

  • Event-driven - reacts to pending pods in milliseconds vs CA's 10s+ polling loop
  • Direct cloud API - bypasses ASGs/VMSS entirely, selects instance types dynamically
  • Workload-aware - evaluates GPU, memory, CPU, taints, affinity constraints together and picks the optimal instance

For inference specifically, the important Karpenter configuration is restricting your NodePool to a consistent GPU family. This matters downstream for compile cache sharing (covered in Phase 4).

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["p4d.24xlarge"]  # A100 instances - consistent GPU architecture
      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
Enter fullscreen mode Exit fullscreen mode

Scaling Strategy: Scale When N Pods Are Running

Instead of warm pools with pause pods, configure your HPA/KEDA to trigger node provisioning proactively when existing pods approach saturation. The idea: when you have N pods running and load reaches a threshold, scale to N+1 before all pods are saturated.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: vllm_pending_requests
        query: sum(vllm:num_requests_waiting)
        threshold: "10"  # Scale when queue depth exceeds threshold
Enter fullscreen mode Exit fullscreen mode

This ensures Karpenter sees a pending pod early, begins provisioning, and the new pod is ready before existing pods are overwhelmed. You never scale from zero - you scale from N to N+1.

The hard floor remains: VM allocation + OS boot + GPU driver initialization + kubelet registration = 90–120 seconds for GPU instances. GPU driver init is the hidden cost - NVIDIA driver + device enumeration adds 20–30s that CPU instances don't pay. The only way past this floor is having the node already running.

Cloud-Specific Provisioning


Phase 2: Container Image Pull

The Problem

A vLLM container image is 10–20GB. When Karpenter provisions a fresh node, downloading from DockerHub/ECR/ACR saturates NAT gateways. Multiple nodes pulling simultaneously = ImagePullBackOff errors and 3-5 minute delays.

Why Lazy Pulling (eStargz/SOCI) Fails for GPU Workloads

Lazy pulling restructures images to allow on-demand file extraction via FUSE. The container "starts" immediately and fetches files as needed via HTTP Range Requests.

Why this is catastrophic for ML inference: Python ML runtimes import thousands of shared objects sequentially (torch, libcudart, triton, etc.). Each uncached import = synchronous HTTP round-trip through FUSE. The result: pull time drops dramatically, but application readiness (time to first successful /health response) regresses badly - often by an order of magnitude or more. The latency shifts from a one-time bulk download to thousands of small, serialized network round-trips spread across the Python import sequence.

The registry becomes a runtime dependency. Every import torch call blocks on network I/O. Do not use lazy pulling for LLM inference.

Spegel: P2P Image Distribution

Spegel is the practical P2P solution for GPU clusters - stateless, lightweight, zero control plane overhead.

How it works:

  1. DaemonSet on every node (must tolerate GPU taints)
  2. Advertises SHA256 layer digests via Kademlia DHT
  3. containerd configured to route pulls through localhost mirror
  4. New nodes query DHT, stream layers from peers over internal VPC network
  5. 404 fallback to external registry if layer not found in cluster
helm upgrade --install spegel oci://ghcr.io/spegel-org/helm-charts/spegel \
  --namespace spegel --create-namespace \
  --version v0.7.1 \
  --set "spegel.mirrorResolveTimeout=5s" \
  --set "spegel.mirrorResolveRetries=5" \
  --set "tolerations[0].key=nvidia.com/gpu" \
  --set "tolerations[0].operator=Exists" \
  --set "tolerations[0].effect=NoSchedule"
Enter fullscreen mode Exit fullscreen mode

Critical requirement: For Spegel to work, at least one node must already have the image cached (this is why you scale from N to N+1, not from zero).

Compatibility:
Refer - https://spegel.dev/docs/getting-started/#compatibility

Issue: Default mirrorResolveTimeout Is Too Aggressive
Spegel's default mirrorResolveTimeout is 20ms. This is extremely tight Kademlia DHT lookups that exceed 20ms fall back to the upstream registry, even when a peer has the layer. This explains why you might see ~90% hit rates instead of near-100%. Increasing to 5s with 5 retries gives the P2P network enough time to resolve peers.

Issue: containerd 2.1 Breaks Spegel Silently
This was the most time-consuming debugging issue. If you're running AL2023, Ubuntu 24.04, or any OS with containerd 2.1, there are three breaking defaults:

1. use_local_image_pull = false (default)
containerd 2.1 routes all pulls through a new transfer service (io.containerd.transfer.v1). This transfer service does NOT honor registry mirrors in hosts.toml. Spegel is silently bypassed - every pull goes to the external registry regardless of mirror configuration.

2. discard_unpacked_layers = true (default)
containerd discards compressed layers after extraction. Spegel needs preserved layers to serve them to peers. Without them, the P2P network fails silently.

3. Registry-specific hosts.toml overrides _default/hosts.toml
If you create docker.io/hosts.toml in your node userData, it overrides Spegel's _default/hosts.toml mirror configuration. Spegel's init container creates the _default/hosts.toml - do NOT create registry-specific host files.

The fix - containerd overrides in node userData:

# Example: EC2NodeClass userData for AL2023
userData: |
  MIME-Version: 1.0
  Content-Type: multipart/mixed; boundary="BOUNDARY"

  --BOUNDARY
  Content-Type: application/node.eks.aws

  ---
  apiVersion: node.eks.aws/v1alpha1
  kind: NodeConfig
  spec:
    containerd:
      config: |
        [plugins."io.containerd.cri.v1.images".registry]
          config_path = "/etc/containerd/certs.d"
        [plugins."io.containerd.cri.v1.images"]
          discard_unpacked_layers = false
          use_local_image_pull = true
  --BOUNDARY--
Enter fullscreen mode Exit fullscreen mode

Cloud-Native Image Solutions

These outperform P2P for most use cases:

Verdict: Cloud-native streaming > Spegel P2P > Lazy pulling. Use Spegel for multi-cloud or when cloud-native options are unavailable.


Phase 3: Model Weight Loading

The Math

T_transfer = Payload_Size / Bandwidth
130 GB over 10 Gbps = ~104 seconds (theoretical minimum)
Enter fullscreen mode Exit fullscreen mode

Real-world transfers from S3/HuggingFace Hub: 3–5 minutes due to TCP overhead and rate limiting.

Shared Storage (PVC) for Pre-Downloaded Weights
The standard pattern: pre-download weights to a ReadWriteMany PVC, mount it in inference pods.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc  # Or Azure Files / Filestore
  resources:
    requests:
      storage: 200Gi
Enter fullscreen mode Exit fullscreen mode

Pre-download via Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: model-download
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: downloader
          image: python:3.11-slim
          command: ["sh", "-c"]
          args:
            - |
              pip install huggingface_hub[hf_xet]
              huggingface-cli download meta-llama/Llama-3-70B-Instruct \
                --local-dir /shared/models/Llama-3-70B-Instruct
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: shared-cache
              mountPath: /shared
      volumes:
        - name: shared-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
Enter fullscreen mode Exit fullscreen mode

Limitation: Shared filesystems (EFS/NFS/Azure Files) have IOPS bottlenecks when many nodes read 130GB concurrently. EFS elastic mode delivers ~130 MB/s - that's 15x slower than local NVMe.

Node-Local NVMe: The Performance Path

A100 instances (p4d/p5 on AWS, ND-series on Azure, A2/A3 on GKE) include physically attached NVMe disks delivering ~2,000+ MB/s reads. Use them.

NVMe Setup: Approaches by Environment
Option 1: Karpenter instanceStorePolicy (recommended for EKS)
Karpenter's EC2NodeClass supports instanceStorePolicy: RAID0, which auto-formats all NVMe instance store devices as RAID0 and mounts them to /mnt/k8s-disks/0. Kubelet uses this as ephemeral storage, so emptyDir volumes are automatically NVMe-backed. No userData script, no privileged pods, and no manual device detection.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-nodeclass
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  instanceStorePolicy: RAID0
  # ... other fields (role, subnets, security groups, etc.)
Enter fullscreen mode Exit fullscreen mode

This is the cleanest approach for Karpenter one line in the EC2NodeClass spec. The NVMe is ready before any pod schedules, and using emptyDir instead of hostPath means no hardcoded mount paths and no Pod Security Policy concerns.

Option 2: userData/cloud-init
For environments without Karpenter (self-managed ASGs, other provisioners), format and mount NVMe during node boot via userData:

# EC2NodeClass userData - formats NVMe at boot
userData: |
  MIME-Version: 1.0
  Content-Type: multipart/mixed; boundary="BOUNDARY"

  --BOUNDARY
  Content-Type: text/x-shellscript

  #!/bin/bash
  # Format and mount instance store NVMe (skip boot volume nvme0)
  DEVICES=$(ls /dev/nvme*n1 2>/dev/null | grep -v nvme0)
  if [ -n "$DEVICES" ]; then
    DEVICE_COUNT=$(echo "$DEVICES" | wc -l)
    if [ "$DEVICE_COUNT" -gt 1 ]; then
      yum install -y mdadm || apt-get install -y mdadm
      echo "y" | mdadm --create /dev/md0 --level=0 \
        --raid-devices=$DEVICE_COUNT $DEVICES
      mkfs.ext4 -F /dev/md0
      mkdir -p /mnt/fast-disks
      mount /dev/md0 /mnt/fast-disks
    else
      mkfs.ext4 -F $DEVICES
      mkdir -p /mnt/fast-disks
      mount $DEVICES /mnt/fast-disks
    fi
    chmod 777 /mnt/fast-disks
  fi

  --BOUNDARY
  Content-Type: application/node.eks.aws
  # ... rest of NodeConfig (containerd overrides etc.)
  --BOUNDARY--
Enter fullscreen mode Exit fullscreen mode

Note: AL2023 uses dnf, not yum. This approach still avoids privileged DaemonSets and Pod Security Policy violations.

Option 3: Cloud-managed NVMe provisioning
Azure Container Storage auto-provisions NVMe RAID0 StoragePools on ND-series VMs (similar to Karpenter's instanceStorePolicy). GKE local SSDs can be configured via node pool settings.

Option 4: Privileged DaemonSet (development/testing only)
A DaemonSet with privileged: true and hostPID: true can format NVMe drives post-boot. However, this is typically blocked in production by Pod Security Standards (Restricted/Baseline), OPA/Gatekeeper/Kyverno policies, and compliance requirements (SOC2, PCI-DSS). Only use this in development clusters where security policies are relaxed.

InitContainer: PVC → NVMe Sync
Once NVMe is available (via any method above), use an initContainer to copy the pre-downloaded model from the shared PVC to NVMe-backed storage:

initContainers:
  - name: model-sync
    image: alpine:latest
    command: ["sh", "-c"]
    args:
      - |
        echo "Syncing model from shared PVC to NVMe-backed storage..."
        cp -r /shared/models/Llama-3-70B-Instruct /nvme/models/
        echo "Sync complete."
    volumeMounts:
      - name: shared-cache
        mountPath: /shared
        readOnly: true
      - name: nvme-storage
        mountPath: /nvme
Enter fullscreen mode Exit fullscreen mode

With instanceStorePolicy: RAID0, mount the NVMe-backed volume as emptyDir: {} (kubelet places it on the NVMe mount automatically). With the userData approach, use hostPath: { path: /mnt/fast-disks }. The emptyDir approach is preferred because it avoids hardcoded paths, works with Pod Security Standards, and Kubernetes manages the lifecycle.

Each pod pays the PVC to NVMe copy cost (~60–90s for 130GB at EFS elastic throughput). With emptyDir, each pod copies independently (emptyDir is per-pod), but the copy from EFS to NVMe is still far faster than reading directly from EFS during inference.

Cloud-Specific Model Storage

GKE's Hyperdisk ML is notable: a single pre-populated volume mounted read-only across 2,500 pods eliminates all multi-node download redundancy.

NVIDIA ModelExpress / NIXL (Experimental)

Note: I have not tested this personally. The following is from NVIDIA documentation and community reports. Including it because it represents the theoretical fastest path for weight distribution in multi-node GPU clusters.

For environments with RDMA/InfiniBand interconnects (multi-node A100/H100 clusters), NVIDIA ModelExpress enables P2P weight distribution:

  • New worker communicates with metadata server (Redis sidecar)
  • Locates active GPU worker running the same model
  • Streams tensors directly from active worker's GPU memory via RDMA
  • Zero storage dependency for scale-out
# UCX transport configuration for NIXL
UCX_TLS=rc_x,rc,dc_x,dc,cuda_copy
UCX_RNDV_SCHEME=get_zcopy
MODEL_EXPRESS_NO_SHARED_STORAGE=1  # gRPC fallback when shared storage unavailable
Enter fullscreen mode Exit fullscreen mode

This is the fastest possible weight distribution - active GPU memory to new GPU memory over InfiniBand. Relevant for large TP deployments on H100 clusters with EFA/InfiniBand interconnects.


Phase 4: fastsafetensors and Weight Loading Optimization

Standard Loading Path

The CPU RAM acts as a bounce buffer - data passes through without computation, purely as a transfer intermediary. For 130GB of weights on A100 with 80GB VRAM, this means multiple sequential PCIe transfers with CPU orchestration. Multi-minute load times.

fastsafetensors: Faster Weight Loading

fastsafetensors provides two loading paths:

1. GDS path: On GDS-optimized distributed filesystems (Lustre, WekaFS), it uses NVIDIA GPUDirect Storage to DMA directly from storage to GPU VRAM, bypassing CPU entirely. Performance: 4.8x to 7.5x speedup over standard loading.

2. POSIX I/O path (nogds): On local NVMe/ext4 or when GDS drivers are unavailable, it falls back to an optimized POSIX I/O path. This is still significantly faster than standard loading when reading from NVMe (~2,000+ MB/s) vs shared filesystems like EFS (~130 MB/s).
The key insight: the biggest performance gain comes from NVMe vs shared filesystem, not from GDS bypassing the CPU. Moving model weights from EFS to local NVMe is a ~15x bandwidth improvement regardless of whether GDS is active.

Enable in vLLM:

args:
  - "--load-format"
  - "fastsafetensors"
env:
  - name: USE_FASTSAFETENSOR
    value: "true"
Enter fullscreen mode Exit fullscreen mode

GDS: When It Matters and When It Doesn't
GDS (GPUDirect Storage) enables direct DMA from storage to GPU VRAM, bypassing CPU bounce buffers. But GDS is a filesystem-level optimization, not just a driver. It requires a GDS-optimized filesystem to function:

GDS-optimized filesystems: Lustre (FSx for Lustre), WekaFS, GPFS these support the cuFile API natively. On these, fastsafetensors delivers 4.8–7.5x speedup via true DMA.

Local NVMe/ext4: Not GDS-optimized. Even with nvidia_fs.ko loaded, GDS runs in compatibility mode (CPU bounce buffer, no faster than POSIX I/O) or doesn't engage at all. fastsafetensors detects this and falls back to its nogds path.

How to check GDS status on your node:

# Check if nvidia-gds package is installed
dpkg -l | grep nvidia-gds    # Debian/Ubuntu
rpm -qa | grep nvidia-gds     # RHEL/AL2023
Enter fullscreen mode Exit fullscreen mode
# Check if GDS kernel module is loaded
lsmod | grep nvidia_fs
# From inside a container, check vLLM's detection
# Look for this in logs:
# "GDS not enabled, setting nogds=True"  ← GDS NOT available
# No such message = GDS is active
Enter fullscreen mode Exit fullscreen mode

GDS Compatibility by Instance:
GDS support depends on the driver stack (nvidia-fs kernel module, MOFED/OFED drivers) and critically, the storage filesystem. Any modern NVIDIA datacenter GPU can technically do GDS if the correct drivers are installed AND the filesystem supports the cuFile API.

Issue I hit: On g6 instances (L4, testing a smaller model), vLLM logged GDS not enabled, setting nogds=True. The AL2023 GPU AMI does not include nvidia-gds. But even installing GDS would not have helped here the model was on local NVMe/ext4, which is not a GDS-optimized filesystem. The real fix was moving model weights from EFS to NVMe, not installing GDS drivers.

On datacenter instances (p4d/p5/ND-series/A3) the MOFED + nvidia-fs stack is pre-installed. fastsafetensors delivers its full 4.8–7.5x GDS speedup only when reading from a GDS-optimized distributed filesystem (e.g., FSx for Lustre). When reading from local NVMe/ext4 on these same instances, the GDS bypass does not engage the speedup comes from NVMe bandwidth, not GDS.

NUMA Binding for Multi-Socket A100/H100 Servers
On p4d (8x A100) and p5 (8x H100), the machine has multiple NUMA domains. Without binding, Python workers may drift across CPU sockets, causing cross-NUMA memory access during Tensor Parallel sharding.
The standard approach is to wrap vLLM with numactl or configure NUMA affinity via environment variables:

env:
  - name: VLLM_WORKER_MULTIPROC_METHOD
    value: "spawn"  # Required for multi-GPU - default 'fork' causes issues with CUDA contexts
Enter fullscreen mode Exit fullscreen mode

For explicit NUMA pinning, wrap the entrypoint with numactl:

command: ["numactl", "--cpunodebind=0", "--membind=0", "python3", "-m", "vllm.entrypoints.openai.api_server"]
Enter fullscreen mode Exit fullscreen mode

This pins each GPU worker to the CPU cores and memory closest to its PCIe lanes. Primarily affects steady-state throughput rather than startup latency, but prevents throughput degradation under TP configurations.

Note: Some vLLM versions expose a --enable-numa flag. Verify availability in your target version - the NUMA interface has changed across releases. The VLLM_WORKER_MULTIPROC_METHOD=spawn env var is the stable requirement for multi-GPU setups.


Phase 5: torch.compile Cache Persistence

The Problem
After weights are loaded, vLLM uses torch.compile to trace CUDA graphs for the decode loop. TorchInductor generates low-level device kernels optimized for the specific GPU architecture.

Default FULL_AND_PIECEWISE mode captures:

  • Monolithic decode graphs for uniform sequence lengths
  • Segmented piecewise graphs for variable prefill dimensions

This compilation is strictly serial, CPU-bound. Every new pod pays this penalty independently. Compilation time scales with model size and sequence length range: a 7B model on a single GPU compiles in ~20–30s, while a 70B model across 8x A100 with large context ranges can take 2–5 minutes.

The Fix: Shared Compile Cache on RWX Storage
Redirect the compile cache to shared storage. First pod compiles, all subsequent pods reuse.

env:
  - name: VLLM_CACHE_ROOT
    value: "/shared/compile_cache"  # Points to RWX PVC (EFS/Azure Files/Filestore)
Enter fullscreen mode Exit fullscreen mode
Cache structure:
/shared/compile_cache/
  torch_compile_cache/
    <hash>/                         # model config + PyTorch version + GPU compute capability
      rank_0_0/
        backbone/
          transformed_code.py       # Compiled Python code
          computation_graph.py      # Graph structure
          inductor_cache/           # Final compiled kernels
Enter fullscreen mode Exit fullscreen mode

The cached artifacts are environment-specific - they're safe to reuse only across pods sharing the same GPU architecture, CUDA, PyTorch, and vLLM version. vLLM derives this by hashing a long list of config and PyTorch factors, and the underlying Inductor kernels are compiled to architecture-specific cubins - so the binding to a specific GPU is real even though it isn't a single tidy "compute capability" field. First pod writes the cache; subsequent pods in the same environment hit and load directly. This is vendor-confirmed: vLLM's docs state the compiled artifacts can be reused across machines with the same environment, and explicitly recommend generating the cache once and sharing it among instances for autoscaling.

Result from implementation (7B model, single L4 GPU, max_model_len=12000):

Log evidence of a successful hit:

That's a 60% reduction on a 7B model from a single environment variable. For larger models (70B on 8x A100), fresh compilation can take 2–5 minutes - the cache hit savings scale proportionally, typically reducing to 10–20s of cache loading.

Issue: GPU Architecture Specificity

The compile cache hash includes GPU compute capability:

A100 (compute 8.0) → cache hash: 8d22bdd77e
H100 (compute 9.0) → cache hash: f04cb94f7b
Enter fullscreen mode Exit fullscreen mode

An A100 cache CANNOT be reused on an H100 node. If your NodePool allows mixed GPU types, pods on different architectures face full recompilation despite cache existing.
Solution: Restrict NodePool to a single GPU family:

requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["p4d.24xlarge"]  # All nodes = A100, same compute capability
Enter fullscreen mode Exit fullscreen mode

This is why Phase 1 matters - consistent GPU architecture enables compile cache sharing cluster-wide.

Compilation Mode Reference


Complete Optimized Deployment
Putting it all together - vLLM on A100 with all optimizations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-optimized
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-optimized
  template:
    metadata:
      labels:
        app: vllm-optimized
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      initContainers:
        - name: model-sync
          image: alpine:latest
          command: ["sh", "-c"]
          args:
            - |
              echo "Copying model from shared PVC to NVMe-backed emptyDir..."
              cp -r /shared/models/Llama-3-70B-Instruct /nvme/models/
          volumeMounts:
            - name: shared-cache
              mountPath: /shared
              readOnly: true
            - name: nvme-storage
              mountPath: /nvme
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          command: ["numactl", "--cpunodebind=0", "--membind=0",
                    "python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
            - "--model"
            - "/nvme/models/Llama-3-70B-Instruct"
            - "--tensor-parallel-size"
            - "8"
            - "--load-format"
            - "fastsafetensors"
            - "--gpu-memory-utilization"
            - "0.90"
            - "--dtype"
            - "bfloat16"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
          env:
            - name: USE_FASTSAFETENSOR
              value: "true"
            - name: VLLM_WORKER_MULTIPROC_METHOD
              value: "spawn"
            - name: VLLM_CACHE_ROOT
              value: "/shared/compile_cache"
          resources:
            requests:
              nvidia.com/gpu: "8"
              cpu: "96"
              memory: "1024Gi"
            limits:
              nvidia.com/gpu: "8"
          volumeMounts:
            - name: nvme-storage
              mountPath: /nvme
              readOnly: true
            - name: shared-cache
              mountPath: /shared
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            failureThreshold: 120
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            failureThreshold: 3
            periodSeconds: 5
      volumes:
        - name: nvme-storage
          emptyDir: {}  # NVMe-backed via instanceStorePolicy: RAID0
        - name: shared-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
Enter fullscreen mode Exit fullscreen mode

Note: The nvme-storage volume uses emptyDir: {} which is automatically NVMe-backed when the EC2NodeClass has instanceStorePolicy: RAID0. For non-Karpenter environments using userData-formatted NVMe, replace with hostPath: { path: /mnt/fast-disks }.


Cloud-Native Reference Architectures

AWS EKS

Karpenter (EC2NodeClass) → p4d.24xlarge (8x A100)
  → Bottlerocket + EBS Snapshot (zero-second image pull)
  → EFS PVC (compile cache, RWX)
  → NVMe hostPath (model weights, PCIe speed)
  → vLLM + fastsafetensors + VLLM_CACHE_ROOT
Enter fullscreen mode Exit fullscreen mode

Azure AKS

NAP (AKSNodeClass) → ND A100 v4
  → ACR Artifact Streaming (~5s image availability)
  → Azure Container Storage (NVMe RAID0, auto-provisioned)
  → Azure Files PVC (compile cache, RWX)
  → vLLM + fastsafetensors + VLLM_CACHE_ROOT
Enter fullscreen mode Exit fullscreen mode

Google GKE

ComputeClasses (NAP) → A2/A3 (A100/H100)
  → GKE Image Streaming (transparent remote mount)
  → Hyperdisk ML (model weights, ReadOnlyMany, 2500 pods)
  → GCS/Filestore PVC (compile cache, RWX)
  → vLLM + fastsafetensors + VLLM_CACHE_ROOT
Enter fullscreen mode Exit fullscreen mode

Comparison


Expected Timings (Theoretical)

With all optimizations on A100 infrastructure (p4d.24xlarge), scaling from N to N+1 pods where N >= 1:

Notes on these numbers:

  • Node provisioning 90–120s includes: EC2/VM API call (~5s) + instance allocation (~15–30s) + OS boot (~15s) + NVIDIA driver initialization (~20–30s) + kubelet registration + node ready (~10–15s). GPU driver init is the hidden cost most people underestimate - it adds 20–30s that CPU instances don't pay.
  • Spegel P2P 30–60s: A 15–20GB image over VPC internal networking (p4d has 100 Gbps baseline). Spegel uses HTTP-based transfer over TCP, not RDMA - real throughput is well below wire speed. DHT lookups and multi-peer coordination add overhead. Pin to v0.7.1 and tune mirrorResolveTimeout to 5s for reliable hit rates
  • Cloud-native streaming 5–15s: EBS Snapshot = data pre-attached at boot (near-instant). ACR/GKE streaming = remote filesystem mount with on-demand paging (container starts before full download).
  • PVC → NVMe sync 60–90s: 130GB at EFS elastic throughput (~1.5 GB/s burst). Shared filesystem read speed is the bottleneck here. On Azure with premium NVMe StoragePool, or with S3/Blob direct download, this can be faster.
  • Weight loading 15–25s: 130GB / 8 GPUs = ~16GB per GPU. PCIe Gen4 x16 = 32 GB/s per GPU theoretical. With safetensors metadata parsing overhead, effective throughput is ~40–60% of theoretical. On local NVMe/ext4, fastsafetensors uses POSIX I/O (not GDS) GDS only engages on distributed filesystems like Lustre.
  • Compile cache load 10–20s for 70B: Loading pre-compiled graphs from shared storage for all TP ranks. Larger models have more compilation units to deserialize.

Lessons Learned

Node Provisioning

  • Restrict NodePool to a single GPU instance type - this enables compile cache sharing across nodes
  • Karpenter's security groups on provisioned nodes may differ from managed node groups - cross-SG ingress rules are needed for pod networking
  • Set consolidateAfter appropriately - too aggressive and you lose nodes that could serve the next scale event

Image Pull (Spegel / P2P)

  • containerd 2.1 on AL2023/Ubuntu 24.04 defaults use_local_image_pull=false - Spegel is silently bypassed
  • discard_unpacked_layers=true breaks P2P layer serving - must be explicitly overridden
  • Registry-specific docker.io/hosts.toml overrides Spegel's _default/hosts.toml - do not create registry-specific host files in userData
  • Spegel requires at least one existing node with the cached image - useless for true scale-from-zero, design around scale-from-N
  • Default mirrorResolveTimeout of 20ms is too aggressive Kademlia DHT lookups exceeding this fall back to upstream. Tune to 5s with 5 retries for better hit rates
  • Large base layers (PyTorch/CUDA) may not have distribution source labels in the image manifest, causing DHT lookup failures - these fall back to registry pulls

Model Weight Loading

  • EFS/NFS throughput (~130 MB/s) is slower than local NVMe (~2000 MB/s) - use NVMe for weight reads, shared PVC for pre-download and cache sharing. Throughput may vary based on file system used.
  • The initContainer PVC → NVMe copy is a one-time cost per node - subsequent pods on the same node find the model on NVMe and skip it
  • For 130GB+ models, PVC → NVMe copy takes 60–90s - this is unavoidable on first boot but is a one-time cost
  • Use Karpenter's instanceStorePolicy: RAID0 for NVMe provisioning (preferred) or userData/cloud-init as a fallback. Avoid privileged DaemonSets.

fastsafetensors / GDS

  • The primary speedup is NVMe vs shared filesystem (~15x bandwidth) not GDS bypassing the CPU
  • nogds=True in vLLM logs is expected and not a problem on local NVMe/ext4 fastsafetensors' POSIX I/O path is still fast
  • GDS only provides its 4.8–7.5x speedup on GDS-optimized distributed filesystems (Lustre, WekaFS, GPFS) not on local ext4/XFS
  • Even with nvidia-gds installed and nvidia_fs loaded, GDS runs in compatibility mode on ext4 (CPU bounce buffer, same as POSIX)
  • Datacenter instances (p4d/p5/ND-series/A3) ship with MOFED + nvidia-fs pre-installed but GDS only engages when paired with a GDS-optimized filesystem
  • For local NVMe setups, fastsafetensors is still worth using its optimized POSIX I/O path is efficient on NVMe

torch.compile Cache

  • Highest ROI optimization - one env var (VLLM_CACHE_ROOT), trivial to implement, 60%+ compile time reduction
  • Cache is GPU-architecture-specific - A100 cache (8.0) != H100 cache (9.0), never mix in same NodePool
  • First pod after a PyTorch version upgrade or model config change will recompile - cache invalidation is hash-based
  • The shared PVC must be ReadWriteMany - first pod writes, subsequent pods read

General

  • Solve every phase - optimizing one just exposes the next bottleneck
  • Cloud-native solutions (image streaming, Hyperdisk ML) give the biggest wins with least operational complexity
  • The physical limit is PCIe bandwidth + VM boot time - everything else is software-solvable
  • Scale from N >= 1, not from zero - this enables Spegel P2P, warm NVMe caches, and avoids the worst-case cold start

Repository
The infrastructure code, Kubernetes manifests, and deployment configurations referenced in this post are available in the companion repository:
GitHub Repository - GPU Autoscaling Accelerator


This post synthesizes research from multiple design iterations and hands-on implementation on EKS with Karpenter, Spegel, fastsafetensors, and persistent torch.compile caching.

Top comments (0)