DEV Community

Cover image for Cold Starts, Model Loading, and Their Impact on Latency SLAs
Daya Shankar
Daya Shankar

Posted on

Cold Starts, Model Loading, and Their Impact on Latency SLAs

Cold start latency breaks SLAs because “pod is Running” isn’t “model is ready.” In Kubernetes hookup with vLLM, cold start includes image pulls, weight downloads, tensor load into GPU memory, and warm-up work (often CUDA-graph-related). These events are rare but huge, so they dominate p95/p99—especially when you scale from zero.

The on-call version of this problem

Bridge: SLAs die on tails, and cold starts are tail generators.

You deploy a new vLLM revision. HPA scales up. Pods come up fast. Traffic shifts. p50 looks fine. p99 explodes.

Nothing “crashed.” You just routed users onto instances still doing model loading and warm-up. That’s not a bug. That’s physics plus orchestration.

If you run strict SLAs on a GPU fleet, you need to treat cold start like a first-class SLI.

What “cold start” actually contains for vLLM on Kubernetes

Bridge: Break the chain into phases so you can measure and fix the slowest link.

Cold start is not one thing. It’s a pipeline:

DIAGRAM 1 — Cold start timeline (what you must budget) 
 
Scale event 


[1] Image pull ---> [2] Container start ---> [3] Model fetch ---> [4] Weight load ---> [5] Warm-up ---> Ready 
| | | | | 
Registry Python init Network/FS Disk->RAM->GPU Graph/caches 

The phase that usually dominates: model storage path

Bridge: If weights sit on slow shared storage, everything else is noise.

vLLM calls this out bluntly: loading large models from shared/network filesystems can be slow, and it’s better to store the model on local disk when possible. It also warns that CPU memory pressure can trigger swapping and slow the OS down. 

Translation:

  • If your weights live on a slow network filesystem, you built a cold-start machine.
  • If you swap while loading weights, you built a cold-start machine that also hurts neighbors.

Warm-up is real work, not “nice to have”

Bridge: If you don’t pre-warm, the first user request becomes your warm-up job.

vLLM provides tooling specifically to benchmark cold vs warm startup, including model loading and compilation/cache operations.
If vLLM ships a benchmark for startup, that’s your sign: startup cost matters.

Why L40S changes the tuning you should do

Bridge: PCIe-only GPUs expose bad data paths immediately.

On NVIDIA L40S, you’re on PCIe Gen4 x16 (64GB/s bidirectional).
Also: NVLink: no and MIG: no

What this means for cold starts:

  • Host↔GPU traffic rides PCIe. Extra copies hurt.
  • You can’t “hide” cold starts by slicing a big GPU into tiny always-warm MIG slices.
  • Your operational levers are boring: caching, warm replicas, and reducing churn.

SLA math: cold starts don’truin averages, they ruin p95/p99

Bridge: You can’t hand-wave tails with “but it’s rare.”

Cold starts are low-frequency, high-impact latency events. That’s exactly what percentiles punish.

If you allow scale-to-zero, your probability of cold starts after idle becomes close to 1 for the first request. Knative documents scale-to-zero as a feature and exposes config to enable/disable it.
Knative also documents Scale Down Delay specifically to keep containers around for a configurable time to avoid cold start penalties. 

Even if you don’t use Knative, the principle holds:

  • Every time you delete a pod, you re-pay model load.
  • Every time you scale to zero, you guarantee a cold start on the next burst.

Fix cold start latency by attacking three things

Bridge: You reduce cold starts by moving fewer bytes, repeating less work, and avoiding scale-to-zero surprises.

1) Cache model artifacts on the node (prefer local disk)

Bridge: Put the bytes next to the GPU node or pay for network + FS latency on every churn event.

vLLM recommends local disk when shared filesystems are slow.
So do this:

  • Mirror model artifacts to a controlled location (object store, internal registry, or artifact repo).
  • Cache on node-local SSD/NVMe where possible.
  • Point vLLM/HF cache directories at that local path.

Practical rule for SREs: don’t download weights from the public internet in the hot path. vLLM itself recommends downloading first (via huggingface-cli) and passing the local path to isolate issues. 

2) Pre-pull images on GPU nodes

Bridge: Image pulls are pure waste during a scale event.

Use a DaemonSet that pins to GPU nodes and pulls your serving image. Keep it dumb.

apiVersion: apps/v1 
kind: DaemonSet 
metadata: 
name: vllm-prepull 
namespace: kube-system 
spec: 
selector: 
matchLabels: { app: vllm-prepull } 
template: 
metadata: 
labels: { app: vllm-prepull } 
spec: 
nodeSelector: 
accelerator: nvidia 
tolerations: 
- key: "accelerator" 
operator: "Equal" 
value: "nvidia" 
effect: "NoSchedule" 
containers: 
- name: sleep 
image: your-registry/vllm-serving:TAG 
command: ["sh","-c","sleep 365000"] 
resources: 
requests: { cpu: "10m", memory: "32Mi" } 
limits: { cpu: "50m", memory: "64Mi" } 

3) Keep a warm floor for SLA-bound services

Bridge: If your SLA can’t tolerate cold starts, don’t scale to zero.

Set:

  • min replicas > 0 (HPA floor or Deployment replicas)
  • a “warm pool” per model
  • separate burst capacity if you need it

Scale-to-zero is a cost tool. It is not an SLA tool. Knative’s own docs bake in knobs to control that behavior for a reason. 

Two diagrams that match how this should be deployed

Bridge: These are the shapes that keep p99 sane without paying for always-on waste.

DIAGRAM 2 — Reference architecture (Kubernetes + vLLM on L40S) 
 
(traffic) 

Ingress / LB 

+------+------+ 
| vLLM Service| (stable endpoint) 
+------+------+ 

+------+------+-------------------+ 
| Warm pool (minReplicas > 0) | 
| - GPU nodeSelector + taints | 
| - readiness gates warm-up | 
+------+------+-------------------+ 

+------+------+-------------------+ 
| Node-local cache (NVMe/SSD) | 
| - model weights cached per node| 
| - image layers pre-pulled | 
+------+------+-------------------+ 

Object store mirror 
(weights/configs, pinned) 

Reference deployment YAML (vLLM on L40S with readiness gating + node-local cache)

Bridge: This is the “copy/paste then edit” block you can review in PRs.

This example does three things:

  • pins onto GPU nodes
  • caches model files on a node-local path
  • gates readiness until a warm-up completes

Important: This assumes you control the container entrypoint and can add a small wrapper script. That’s the cleanest way to tie readiness to “model is hot.”

apiVersion: apps/v1 
kind: Deployment 
metadata: 
name: vllm-serve 
namespace: inference 
spec: 
replicas: 2 # warm floor for SLA 
selector: 
matchLabels: 
app: vllm-serve 
template: 
metadata: 
labels: 
app: vllm-serve 
spec: 
nodeSelector: 
accelerator: nvidia 
gpu: l40s 
tolerations: 
- key: "accelerator" 
operator: "Equal" 
value: "nvidia" 
effect: "NoSchedule" 
containers: 
- name: vllm 
image: your-registry/vllm-serving:TAG 
ports: 
- containerPort: 8000 
resources: 
requests: 
cpu: "4" 
memory: "24Gi" 
nvidia.com/gpu: "1" 
limits: 
cpu: "8" 
memory: "32Gi" 
nvidia.com/gpu: "1" 
env: 
- name: HF_HOME 
value: /models/hf 
- name: MODEL_PATH 
value: /models/hf/my-model # pre-downloaded or mirrored 
command: ["/bin/bash","-lc"] 
args: 
- | 
set -euo pipefail 
rm -f /tmp/ready 
 
# Start vLLM in background 
vllm serve "${MODEL_PATH}" \ 
--host 0.0.0.0 --port 8000 \ 
--dtype auto \ 
--max-model-len 8192 \ 
--tensor-parallel-size 1 \ 
2>&1 | tee /var/log/vllm.log & 
VLLM_PID=$! 
 
# Wait for the server socket, then trigger a warm-up request. 
# Replace the warm-up call with your own internal probe if needed. 
for i in {1..120}; do 
(echo > /dev/tcp/127.0.0.1/8000) >/dev/null 2>&1 && break 
sleep 1 
done 
 
# Minimal warm-up: hit the server once (your internal client/probe here). 
# If you can’t curl the API, run a lightweight local script instead. 
curl -sf http://127.0.0.1:8000/ >/dev/null || true 
 
# Mark ready only after warm-up. 
touch /tmp/ready 
 
# Keep foreground 
wait $VLLM_PID 
readinessProbe: 
exec: 
command: ["/bin/sh","-c","test -f /tmp/ready"] 
periodSeconds: 2 
timeoutSeconds: 1 
failureThreshold: 30 
startupProbe: 
exec: 
command: ["/bin/sh","-c","test -f /var/log/vllm.log"] 
periodSeconds: 2 
timeoutSeconds: 1 
failureThreshold: 300 # allow long first load 
volumeMounts: 
- name: model-cache 
mountPath: /models 
volumes: 
- name: model-cache 
hostPath: 
path: /var/lib/model-cache 
type: DirectoryOrCreate 

Notes for senior SREs

  • hostPath is powerful and dangerous. In a managed Kubernetes environment, you may prefer node-local ephemeral SSD mounts that the platform team controls, or a LocalPV setup with strict node affinity.
  • Set replicas to your SLA floor. Use HPA for burst, but don’t let it go to zero if p99 matters.

Measure it like an SRE: phase timings and startup benchmarks

Bridge: You can’t improve what you can’t attribute.

Use vLLM’s startup benchmark tooling

Bridge: Benchmark cold vs warm startup and block regressions.

vLLM ships a startup benchmark module to measure cold and warm startup times, including model loading and compilation/cache operations. 

Run it against:

  • your container image
  • your model
  • your storage backend
  • your L40S node class

Then fail CI when cold start time regresses.

Log phase timestamps

Bridge: Turn “it’s slow” into numbers you can grep.

Log these timestamps per pod:

  • image pulled (node event)
  • process start
  • model fetch complete
  • weights loaded
  • warm-up complete
  • readiness true

Then build histograms:

  • cold_start_seconds{phase="fetch"}
  • cold_start_seconds{phase="load"}
  • cold_start_seconds{phase="warmup"}

This tells you where to spend effort.

Managed Kubernetes: what it helps, what it doesn’t

Bridge: Managed Kubernetes runs the plumbing. You still own the SLA path.

Managed Kubernetes can:

  • keep control plane stable
  • manage node lifecycle and autoscaler hygiene
  • standardize storage classes and node pools

It will not:

  • pick your cache strategy
  • keep models warm for your SLA
  • prevent you from scaling to zero and cold-starting on every burst

On AceCloud managed Kubernetes, the clean play is: dedicated GPU node pools for vLLM, pre-pull images, cache weights on node-local storage, set warm floors for SLA services, and Script warm-up into readiness. Keep your cold path measured and boring.

Checklist for PR reviews

Bridge: This is the short list that prevents “p99 spikes after deploy.”

  • Model artifacts are local or cached. Not pulled from the public internet at runtime. 
  • GPU node pools are pinned (L40S), tainted, and isolated.
  • Image pre-pull exists on GPU nodes.
  • Readiness gates on “model is hot,” not “process started.”
  • Warm floor exists (min replicas > 0) for SLA services. 
  • Cold vs warm startup is benchmarked in CI. 

If you want, I can also add a short Prometheus section (cold-start phase histograms + alert rules) so the on-call page tells you which phase is burning your SLA.

Top comments (0)