The Problem
When an LLM inference pod scales on Kubernetes, it doesn't start in seconds like a CPU service. It hits a sequential chain of bottlenecks that can stall Time-to-First-Token (TTFT) for 10+ minutes:
Each phase blocks the next. Optimizing one shifts the bottleneck downstream - you must address all four.
The reason these traps hurt so much is that a GPU inference pod doesn't come up in seconds like a stateless web service. It moves through a sequential chain where each stage blocks the next:
Modern inference containers (vLLM + PyTorch + CUDA + NCCL) are 10–20GB. Model weights for production models (Llama-3 70B, DeepSeek-V3) exceed 130GB. This is the reality of GPU autoscaling - every cold start moves tens of gigabytes before generating a single token.
This post covers practical solutions for each phase, issues I hit during implementation, and cloud-native alternatives across providers.
Phase 1: Node Provisioning
Karpenter Over Cluster Autoscaler
If you're still on Cluster Autoscaler for inference workloads, switch to Karpenter. The key differences that matter for GPU scaling:
- Event-driven - reacts to pending pods in milliseconds vs CA's 10s+ polling loop
- Direct cloud API - bypasses ASGs/VMSS entirely, selects instance types dynamically
- Workload-aware - evaluates GPU, memory, CPU, taints, affinity constraints together and picks the optimal instance
For inference specifically, the important Karpenter configuration is restricting your NodePool to a consistent GPU family. This matters downstream for compile cache sharing (covered in Phase 4).
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["p4d.24xlarge"] # A100 instances - consistent GPU architecture
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
Scaling Strategy: Scale When N Pods Are Running
Instead of warm pools with pause pods, configure your HPA/KEDA to trigger node provisioning proactively when existing pods approach saturation. The idea: when you have N pods running and load reaches a threshold, scale to N+1 before all pods are saturated.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
spec:
scaleTargetRef:
name: vllm-inference
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_pending_requests
query: sum(vllm:num_requests_waiting)
threshold: "10" # Scale when queue depth exceeds threshold
This ensures Karpenter sees a pending pod early, begins provisioning, and the new pod is ready before existing pods are overwhelmed. You never scale from zero - you scale from N to N+1.
The hard floor remains: VM allocation + OS boot + GPU driver initialization + kubelet registration = 90–120 seconds for GPU instances. GPU driver init is the hidden cost - NVIDIA driver + device enumeration adds 20–30s that CPU instances don't pay. The only way past this floor is having the node already running.
Cloud-Specific Provisioning
Phase 2: Container Image Pull
The Problem
A vLLM container image is 10–20GB. When Karpenter provisions a fresh node, downloading from DockerHub/ECR/ACR saturates NAT gateways. Multiple nodes pulling simultaneously = ImagePullBackOff errors and 3-5 minute delays.
Why Lazy Pulling (eStargz/SOCI) Fails for GPU Workloads
Lazy pulling restructures images to allow on-demand file extraction via FUSE. The container "starts" immediately and fetches files as needed via HTTP Range Requests.
Why this is catastrophic for ML inference: Python ML runtimes import thousands of shared objects sequentially (torch, libcudart, triton, etc.). Each uncached import = synchronous HTTP round-trip through FUSE. The result: pull time drops dramatically, but application readiness (time to first successful /health response) regresses badly - often by an order of magnitude or more. The latency shifts from a one-time bulk download to thousands of small, serialized network round-trips spread across the Python import sequence.
The registry becomes a runtime dependency. Every import torch call blocks on network I/O. Do not use lazy pulling for LLM inference.
Spegel: P2P Image Distribution
Spegel is the practical P2P solution for GPU clusters - stateless, lightweight, zero control plane overhead.
How it works:
- DaemonSet on every node (must tolerate GPU taints)
- Advertises SHA256 layer digests via Kademlia DHT
- containerd configured to route pulls through localhost mirror
- New nodes query DHT, stream layers from peers over internal VPC network
- 404 fallback to external registry if layer not found in cluster
helm upgrade --install spegel oci://ghcr.io/spegel-org/helm-charts/spegel \
--namespace spegel --create-namespace \
--version v0.7.1 \
--set "spegel.mirrorResolveTimeout=5s" \
--set "spegel.mirrorResolveRetries=5" \
--set "tolerations[0].key=nvidia.com/gpu" \
--set "tolerations[0].operator=Exists" \
--set "tolerations[0].effect=NoSchedule"
Critical requirement: For Spegel to work, at least one node must already have the image cached (this is why you scale from N to N+1, not from zero).
Compatibility:
Refer - https://spegel.dev/docs/getting-started/#compatibility
Issue: Default mirrorResolveTimeout Is Too Aggressive
Spegel's default mirrorResolveTimeout is 20ms. This is extremely tight Kademlia DHT lookups that exceed 20ms fall back to the upstream registry, even when a peer has the layer. This explains why you might see ~90% hit rates instead of near-100%. Increasing to 5s with 5 retries gives the P2P network enough time to resolve peers.
Issue: containerd 2.1 Breaks Spegel Silently
This was the most time-consuming debugging issue. If you're running AL2023, Ubuntu 24.04, or any OS with containerd 2.1, there are three breaking defaults:
1. use_local_image_pull = false (default)
containerd 2.1 routes all pulls through a new transfer service (io.containerd.transfer.v1). This transfer service does NOT honor registry mirrors in hosts.toml. Spegel is silently bypassed - every pull goes to the external registry regardless of mirror configuration.
2. discard_unpacked_layers = true (default)
containerd discards compressed layers after extraction. Spegel needs preserved layers to serve them to peers. Without them, the P2P network fails silently.
3. Registry-specific hosts.toml overrides _default/hosts.toml
If you create docker.io/hosts.toml in your node userData, it overrides Spegel's _default/hosts.toml mirror configuration. Spegel's init container creates the _default/hosts.toml - do NOT create registry-specific host files.
The fix - containerd overrides in node userData:
# Example: EC2NodeClass userData for AL2023
userData: |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="BOUNDARY"
--BOUNDARY
Content-Type: application/node.eks.aws
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
containerd:
config: |
[plugins."io.containerd.cri.v1.images".registry]
config_path = "/etc/containerd/certs.d"
[plugins."io.containerd.cri.v1.images"]
discard_unpacked_layers = false
use_local_image_pull = true
--BOUNDARY--
Cloud-Native Image Solutions
These outperform P2P for most use cases:
Verdict: Cloud-native streaming > Spegel P2P > Lazy pulling. Use Spegel for multi-cloud or when cloud-native options are unavailable.
Phase 3: Model Weight Loading
The Math
T_transfer = Payload_Size / Bandwidth
130 GB over 10 Gbps = ~104 seconds (theoretical minimum)
Real-world transfers from S3/HuggingFace Hub: 3–5 minutes due to TCP overhead and rate limiting.
Shared Storage (PVC) for Pre-Downloaded Weights
The standard pattern: pre-download weights to a ReadWriteMany PVC, mount it in inference pods.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc # Or Azure Files / Filestore
resources:
requests:
storage: 200Gi
Pre-download via Job:
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
template:
spec:
restartPolicy: Never
containers:
- name: downloader
image: python:3.11-slim
command: ["sh", "-c"]
args:
- |
pip install huggingface_hub[hf_xet]
huggingface-cli download meta-llama/Llama-3-70B-Instruct \
--local-dir /shared/models/Llama-3-70B-Instruct
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: shared-cache
mountPath: /shared
volumes:
- name: shared-cache
persistentVolumeClaim:
claimName: model-cache-pvc
Limitation: Shared filesystems (EFS/NFS/Azure Files) have IOPS bottlenecks when many nodes read 130GB concurrently. EFS elastic mode delivers ~130 MB/s - that's 15x slower than local NVMe.
Node-Local NVMe: The Performance Path
A100 instances (p4d/p5 on AWS, ND-series on Azure, A2/A3 on GKE) include physically attached NVMe disks delivering ~2,000+ MB/s reads. Use them.
NVMe Setup: Approaches by Environment
Option 1: Karpenter instanceStorePolicy (recommended for EKS)
Karpenter's EC2NodeClass supports instanceStorePolicy: RAID0, which auto-formats all NVMe instance store devices as RAID0 and mounts them to /mnt/k8s-disks/0. Kubelet uses this as ephemeral storage, so emptyDir volumes are automatically NVMe-backed. No userData script, no privileged pods, and no manual device detection.
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-nodeclass
spec:
amiSelectorTerms:
- alias: al2023@latest
instanceStorePolicy: RAID0
# ... other fields (role, subnets, security groups, etc.)
This is the cleanest approach for Karpenter one line in the EC2NodeClass spec. The NVMe is ready before any pod schedules, and using emptyDir instead of hostPath means no hardcoded mount paths and no Pod Security Policy concerns.
Option 2: userData/cloud-init
For environments without Karpenter (self-managed ASGs, other provisioners), format and mount NVMe during node boot via userData:
# EC2NodeClass userData - formats NVMe at boot
userData: |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="BOUNDARY"
--BOUNDARY
Content-Type: text/x-shellscript
#!/bin/bash
# Format and mount instance store NVMe (skip boot volume nvme0)
DEVICES=$(ls /dev/nvme*n1 2>/dev/null | grep -v nvme0)
if [ -n "$DEVICES" ]; then
DEVICE_COUNT=$(echo "$DEVICES" | wc -l)
if [ "$DEVICE_COUNT" -gt 1 ]; then
yum install -y mdadm || apt-get install -y mdadm
echo "y" | mdadm --create /dev/md0 --level=0 \
--raid-devices=$DEVICE_COUNT $DEVICES
mkfs.ext4 -F /dev/md0
mkdir -p /mnt/fast-disks
mount /dev/md0 /mnt/fast-disks
else
mkfs.ext4 -F $DEVICES
mkdir -p /mnt/fast-disks
mount $DEVICES /mnt/fast-disks
fi
chmod 777 /mnt/fast-disks
fi
--BOUNDARY
Content-Type: application/node.eks.aws
# ... rest of NodeConfig (containerd overrides etc.)
--BOUNDARY--
Note: AL2023 uses dnf, not yum. This approach still avoids privileged DaemonSets and Pod Security Policy violations.
Option 3: Cloud-managed NVMe provisioning
Azure Container Storage auto-provisions NVMe RAID0 StoragePools on ND-series VMs (similar to Karpenter's instanceStorePolicy). GKE local SSDs can be configured via node pool settings.
Option 4: Privileged DaemonSet (development/testing only)
A DaemonSet with privileged: true and hostPID: true can format NVMe drives post-boot. However, this is typically blocked in production by Pod Security Standards (Restricted/Baseline), OPA/Gatekeeper/Kyverno policies, and compliance requirements (SOC2, PCI-DSS). Only use this in development clusters where security policies are relaxed.
InitContainer: PVC → NVMe Sync
Once NVMe is available (via any method above), use an initContainer to copy the pre-downloaded model from the shared PVC to NVMe-backed storage:
initContainers:
- name: model-sync
image: alpine:latest
command: ["sh", "-c"]
args:
- |
echo "Syncing model from shared PVC to NVMe-backed storage..."
cp -r /shared/models/Llama-3-70B-Instruct /nvme/models/
echo "Sync complete."
volumeMounts:
- name: shared-cache
mountPath: /shared
readOnly: true
- name: nvme-storage
mountPath: /nvme
With instanceStorePolicy: RAID0, mount the NVMe-backed volume as emptyDir: {} (kubelet places it on the NVMe mount automatically). With the userData approach, use hostPath: { path: /mnt/fast-disks }. The emptyDir approach is preferred because it avoids hardcoded paths, works with Pod Security Standards, and Kubernetes manages the lifecycle.
Each pod pays the PVC to NVMe copy cost (~60–90s for 130GB at EFS elastic throughput). With emptyDir, each pod copies independently (emptyDir is per-pod), but the copy from EFS to NVMe is still far faster than reading directly from EFS during inference.
Cloud-Specific Model Storage
GKE's Hyperdisk ML is notable: a single pre-populated volume mounted read-only across 2,500 pods eliminates all multi-node download redundancy.
NVIDIA ModelExpress / NIXL (Experimental)
Note: I have not tested this personally. The following is from NVIDIA documentation and community reports. Including it because it represents the theoretical fastest path for weight distribution in multi-node GPU clusters.
For environments with RDMA/InfiniBand interconnects (multi-node A100/H100 clusters), NVIDIA ModelExpress enables P2P weight distribution:
- New worker communicates with metadata server (Redis sidecar)
- Locates active GPU worker running the same model
- Streams tensors directly from active worker's GPU memory via RDMA
- Zero storage dependency for scale-out
# UCX transport configuration for NIXL
UCX_TLS=rc_x,rc,dc_x,dc,cuda_copy
UCX_RNDV_SCHEME=get_zcopy
MODEL_EXPRESS_NO_SHARED_STORAGE=1 # gRPC fallback when shared storage unavailable
This is the fastest possible weight distribution - active GPU memory to new GPU memory over InfiniBand. Relevant for large TP deployments on H100 clusters with EFA/InfiniBand interconnects.
Phase 4: fastsafetensors and Weight Loading Optimization
Standard Loading Path
The CPU RAM acts as a bounce buffer - data passes through without computation, purely as a transfer intermediary. For 130GB of weights on A100 with 80GB VRAM, this means multiple sequential PCIe transfers with CPU orchestration. Multi-minute load times.
fastsafetensors: Faster Weight Loading
fastsafetensors provides two loading paths:
1. GDS path: On GDS-optimized distributed filesystems (Lustre, WekaFS), it uses NVIDIA GPUDirect Storage to DMA directly from storage to GPU VRAM, bypassing CPU entirely. Performance: 4.8x to 7.5x speedup over standard loading.
2. POSIX I/O path (nogds): On local NVMe/ext4 or when GDS drivers are unavailable, it falls back to an optimized POSIX I/O path. This is still significantly faster than standard loading when reading from NVMe (~2,000+ MB/s) vs shared filesystems like EFS (~130 MB/s).
The key insight: the biggest performance gain comes from NVMe vs shared filesystem, not from GDS bypassing the CPU. Moving model weights from EFS to local NVMe is a ~15x bandwidth improvement regardless of whether GDS is active.
Enable in vLLM:
args:
- "--load-format"
- "fastsafetensors"
env:
- name: USE_FASTSAFETENSOR
value: "true"
GDS: When It Matters and When It Doesn't
GDS (GPUDirect Storage) enables direct DMA from storage to GPU VRAM, bypassing CPU bounce buffers. But GDS is a filesystem-level optimization, not just a driver. It requires a GDS-optimized filesystem to function:
GDS-optimized filesystems: Lustre (FSx for Lustre), WekaFS, GPFS these support the cuFile API natively. On these, fastsafetensors delivers 4.8–7.5x speedup via true DMA.
Local NVMe/ext4: Not GDS-optimized. Even with nvidia_fs.ko loaded, GDS runs in compatibility mode (CPU bounce buffer, no faster than POSIX I/O) or doesn't engage at all. fastsafetensors detects this and falls back to its nogds path.
How to check GDS status on your node:
# Check if nvidia-gds package is installed
dpkg -l | grep nvidia-gds # Debian/Ubuntu
rpm -qa | grep nvidia-gds # RHEL/AL2023
# Check if GDS kernel module is loaded
lsmod | grep nvidia_fs
# From inside a container, check vLLM's detection
# Look for this in logs:
# "GDS not enabled, setting nogds=True" ← GDS NOT available
# No such message = GDS is active
GDS Compatibility by Instance:
GDS support depends on the driver stack (nvidia-fs kernel module, MOFED/OFED drivers) and critically, the storage filesystem. Any modern NVIDIA datacenter GPU can technically do GDS if the correct drivers are installed AND the filesystem supports the cuFile API.
Issue I hit: On g6 instances (L4, testing a smaller model), vLLM logged GDS not enabled, setting nogds=True. The AL2023 GPU AMI does not include nvidia-gds. But even installing GDS would not have helped here the model was on local NVMe/ext4, which is not a GDS-optimized filesystem. The real fix was moving model weights from EFS to NVMe, not installing GDS drivers.
On datacenter instances (p4d/p5/ND-series/A3) the MOFED + nvidia-fs stack is pre-installed. fastsafetensors delivers its full 4.8–7.5x GDS speedup only when reading from a GDS-optimized distributed filesystem (e.g., FSx for Lustre). When reading from local NVMe/ext4 on these same instances, the GDS bypass does not engage the speedup comes from NVMe bandwidth, not GDS.
NUMA Binding for Multi-Socket A100/H100 Servers
On p4d (8x A100) and p5 (8x H100), the machine has multiple NUMA domains. Without binding, Python workers may drift across CPU sockets, causing cross-NUMA memory access during Tensor Parallel sharding.
The standard approach is to wrap vLLM with numactl or configure NUMA affinity via environment variables:
env:
- name: VLLM_WORKER_MULTIPROC_METHOD
value: "spawn" # Required for multi-GPU - default 'fork' causes issues with CUDA contexts
For explicit NUMA pinning, wrap the entrypoint with numactl:
command: ["numactl", "--cpunodebind=0", "--membind=0", "python3", "-m", "vllm.entrypoints.openai.api_server"]
This pins each GPU worker to the CPU cores and memory closest to its PCIe lanes. Primarily affects steady-state throughput rather than startup latency, but prevents throughput degradation under TP configurations.
Note: Some vLLM versions expose a --enable-numa flag. Verify availability in your target version - the NUMA interface has changed across releases. The VLLM_WORKER_MULTIPROC_METHOD=spawn env var is the stable requirement for multi-GPU setups.
Phase 5: torch.compile Cache Persistence
The Problem
After weights are loaded, vLLM uses torch.compile to trace CUDA graphs for the decode loop. TorchInductor generates low-level device kernels optimized for the specific GPU architecture.
Default FULL_AND_PIECEWISE mode captures:
- Monolithic decode graphs for uniform sequence lengths
- Segmented piecewise graphs for variable prefill dimensions
This compilation is strictly serial, CPU-bound. Every new pod pays this penalty independently. Compilation time scales with model size and sequence length range: a 7B model on a single GPU compiles in ~20–30s, while a 70B model across 8x A100 with large context ranges can take 2–5 minutes.
The Fix: Shared Compile Cache on RWX Storage
Redirect the compile cache to shared storage. First pod compiles, all subsequent pods reuse.
env:
- name: VLLM_CACHE_ROOT
value: "/shared/compile_cache" # Points to RWX PVC (EFS/Azure Files/Filestore)
Cache structure:
/shared/compile_cache/
torch_compile_cache/
<hash>/ # model config + PyTorch version + GPU compute capability
rank_0_0/
backbone/
transformed_code.py # Compiled Python code
computation_graph.py # Graph structure
inductor_cache/ # Final compiled kernels
The cached artifacts are environment-specific - they're safe to reuse only across pods sharing the same GPU architecture, CUDA, PyTorch, and vLLM version. vLLM derives this by hashing a long list of config and PyTorch factors, and the underlying Inductor kernels are compiled to architecture-specific cubins - so the binding to a specific GPU is real even though it isn't a single tidy "compute capability" field. First pod writes the cache; subsequent pods in the same environment hit and load directly. This is vendor-confirmed: vLLM's docs state the compiled artifacts can be reused across machines with the same environment, and explicitly recommend generating the cache once and sharing it among instances for autoscaling.
Result from implementation (7B model, single L4 GPU, max_model_len=12000):
Log evidence of a successful hit:
That's a 60% reduction on a 7B model from a single environment variable. For larger models (70B on 8x A100), fresh compilation can take 2–5 minutes - the cache hit savings scale proportionally, typically reducing to 10–20s of cache loading.
Issue: GPU Architecture Specificity
The compile cache hash includes GPU compute capability:
A100 (compute 8.0) → cache hash: 8d22bdd77e
H100 (compute 9.0) → cache hash: f04cb94f7b
An A100 cache CANNOT be reused on an H100 node. If your NodePool allows mixed GPU types, pods on different architectures face full recompilation despite cache existing.
Solution: Restrict NodePool to a single GPU family:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["p4d.24xlarge"] # All nodes = A100, same compute capability
This is why Phase 1 matters - consistent GPU architecture enables compile cache sharing cluster-wide.
Compilation Mode Reference
Complete Optimized Deployment
Putting it all together - vLLM on A100 with all optimizations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-optimized
spec:
replicas: 1
selector:
matchLabels:
app: vllm-optimized
template:
metadata:
labels:
app: vllm-optimized
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
initContainers:
- name: model-sync
image: alpine:latest
command: ["sh", "-c"]
args:
- |
echo "Copying model from shared PVC to NVMe-backed emptyDir..."
cp -r /shared/models/Llama-3-70B-Instruct /nvme/models/
volumeMounts:
- name: shared-cache
mountPath: /shared
readOnly: true
- name: nvme-storage
mountPath: /nvme
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["numactl", "--cpunodebind=0", "--membind=0",
"python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "/nvme/models/Llama-3-70B-Instruct"
- "--tensor-parallel-size"
- "8"
- "--load-format"
- "fastsafetensors"
- "--gpu-memory-utilization"
- "0.90"
- "--dtype"
- "bfloat16"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
env:
- name: USE_FASTSAFETENSOR
value: "true"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: "spawn"
- name: VLLM_CACHE_ROOT
value: "/shared/compile_cache"
resources:
requests:
nvidia.com/gpu: "8"
cpu: "96"
memory: "1024Gi"
limits:
nvidia.com/gpu: "8"
volumeMounts:
- name: nvme-storage
mountPath: /nvme
readOnly: true
- name: shared-cache
mountPath: /shared
startupProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 120
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 3
periodSeconds: 5
volumes:
- name: nvme-storage
emptyDir: {} # NVMe-backed via instanceStorePolicy: RAID0
- name: shared-cache
persistentVolumeClaim:
claimName: model-cache-pvc
Note: The nvme-storage volume uses emptyDir: {} which is automatically NVMe-backed when the EC2NodeClass has instanceStorePolicy: RAID0. For non-Karpenter environments using userData-formatted NVMe, replace with hostPath: { path: /mnt/fast-disks }.
Cloud-Native Reference Architectures
AWS EKS
Karpenter (EC2NodeClass) → p4d.24xlarge (8x A100)
→ Bottlerocket + EBS Snapshot (zero-second image pull)
→ EFS PVC (compile cache, RWX)
→ NVMe hostPath (model weights, PCIe speed)
→ vLLM + fastsafetensors + VLLM_CACHE_ROOT
Azure AKS
NAP (AKSNodeClass) → ND A100 v4
→ ACR Artifact Streaming (~5s image availability)
→ Azure Container Storage (NVMe RAID0, auto-provisioned)
→ Azure Files PVC (compile cache, RWX)
→ vLLM + fastsafetensors + VLLM_CACHE_ROOT
Google GKE
ComputeClasses (NAP) → A2/A3 (A100/H100)
→ GKE Image Streaming (transparent remote mount)
→ Hyperdisk ML (model weights, ReadOnlyMany, 2500 pods)
→ GCS/Filestore PVC (compile cache, RWX)
→ vLLM + fastsafetensors + VLLM_CACHE_ROOT
Comparison
Expected Timings (Theoretical)
With all optimizations on A100 infrastructure (p4d.24xlarge), scaling from N to N+1 pods where N >= 1:
Notes on these numbers:
- Node provisioning 90–120s includes: EC2/VM API call (~5s) + instance allocation (~15–30s) + OS boot (~15s) + NVIDIA driver initialization (~20–30s) + kubelet registration + node ready (~10–15s). GPU driver init is the hidden cost most people underestimate - it adds 20–30s that CPU instances don't pay.
- Spegel P2P 30–60s: A 15–20GB image over VPC internal networking (p4d has 100 Gbps baseline). Spegel uses HTTP-based transfer over TCP, not RDMA - real throughput is well below wire speed. DHT lookups and multi-peer coordination add overhead. Pin to v0.7.1 and tune mirrorResolveTimeout to 5s for reliable hit rates
- Cloud-native streaming 5–15s: EBS Snapshot = data pre-attached at boot (near-instant). ACR/GKE streaming = remote filesystem mount with on-demand paging (container starts before full download).
- PVC → NVMe sync 60–90s: 130GB at EFS elastic throughput (~1.5 GB/s burst). Shared filesystem read speed is the bottleneck here. On Azure with premium NVMe StoragePool, or with S3/Blob direct download, this can be faster.
- Weight loading 15–25s: 130GB / 8 GPUs = ~16GB per GPU. PCIe Gen4 x16 = 32 GB/s per GPU theoretical. With safetensors metadata parsing overhead, effective throughput is ~40–60% of theoretical. On local NVMe/ext4, fastsafetensors uses POSIX I/O (not GDS) GDS only engages on distributed filesystems like Lustre.
- Compile cache load 10–20s for 70B: Loading pre-compiled graphs from shared storage for all TP ranks. Larger models have more compilation units to deserialize.
Lessons Learned
Node Provisioning
- Restrict NodePool to a single GPU instance type - this enables compile cache sharing across nodes
- Karpenter's security groups on provisioned nodes may differ from managed node groups - cross-SG ingress rules are needed for pod networking
- Set
consolidateAfterappropriately - too aggressive and you lose nodes that could serve the next scale event
Image Pull (Spegel / P2P)
- containerd 2.1 on AL2023/Ubuntu 24.04 defaults
use_local_image_pull=false- Spegel is silently bypassed -
discard_unpacked_layers=truebreaks P2P layer serving - must be explicitly overridden - Registry-specific
docker.io/hosts.tomloverrides Spegel's_default/hosts.toml- do not create registry-specific host files in userData - Spegel requires at least one existing node with the cached image - useless for true scale-from-zero, design around scale-from-N
- Default mirrorResolveTimeout of 20ms is too aggressive Kademlia DHT lookups exceeding this fall back to upstream. Tune to 5s with 5 retries for better hit rates
- Large base layers (PyTorch/CUDA) may not have distribution source labels in the image manifest, causing DHT lookup failures - these fall back to registry pulls
Model Weight Loading
- EFS/NFS throughput (~130 MB/s) is slower than local NVMe (~2000 MB/s) - use NVMe for weight reads, shared PVC for pre-download and cache sharing. Throughput may vary based on file system used.
- The initContainer PVC → NVMe copy is a one-time cost per node - subsequent pods on the same node find the model on NVMe and skip it
- For 130GB+ models, PVC → NVMe copy takes 60–90s - this is unavoidable on first boot but is a one-time cost
- Use Karpenter's
instanceStorePolicy: RAID0for NVMe provisioning (preferred) or userData/cloud-init as a fallback. Avoid privileged DaemonSets.
fastsafetensors / GDS
- The primary speedup is NVMe vs shared filesystem (~15x bandwidth) not GDS bypassing the CPU
-
nogds=Truein vLLM logs is expected and not a problem on local NVMe/ext4 fastsafetensors' POSIX I/O path is still fast - GDS only provides its 4.8–7.5x speedup on GDS-optimized distributed filesystems (Lustre, WekaFS, GPFS) not on local ext4/XFS
- Even with nvidia-gds installed and nvidia_fs loaded, GDS runs in compatibility mode on ext4 (CPU bounce buffer, same as POSIX)
- Datacenter instances (p4d/p5/ND-series/A3) ship with MOFED + nvidia-fs pre-installed but GDS only engages when paired with a GDS-optimized filesystem
- For local NVMe setups, fastsafetensors is still worth using its optimized POSIX I/O path is efficient on NVMe
torch.compile Cache
- Highest ROI optimization - one env var (
VLLM_CACHE_ROOT), trivial to implement, 60%+ compile time reduction - Cache is GPU-architecture-specific - A100 cache (8.0) != H100 cache (9.0), never mix in same NodePool
- First pod after a PyTorch version upgrade or model config change will recompile - cache invalidation is hash-based
- The shared PVC must be ReadWriteMany - first pod writes, subsequent pods read
General
- Solve every phase - optimizing one just exposes the next bottleneck
- Cloud-native solutions (image streaming, Hyperdisk ML) give the biggest wins with least operational complexity
- The physical limit is PCIe bandwidth + VM boot time - everything else is software-solvable
- Scale from N >= 1, not from zero - this enables Spegel P2P, warm NVMe caches, and avoids the worst-case cold start
Repository
The infrastructure code, Kubernetes manifests, and deployment configurations referenced in this post are available in the companion repository:
GitHub Repository - GPU Autoscaling Accelerator
This post synthesizes research from multiple design iterations and hands-on implementation on EKS with Karpenter, Spegel, fastsafetensors, and persistent torch.compile caching.














Top comments (0)