daniel jeong

Posted on Apr 1 • Originally published at manoit.co.kr

Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference

#kubernetes #ai #devops #cloudnative

Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference Framework

At KubeCon Europe 2026 in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly donated llm-d to the CNCF as a Sandbox project. Backed by founding partners including NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, llm-d is a distributed inference framework designed to run large language model (LLM) inference at production scale on Kubernetes.

If you've served models with vLLM or managed inference endpoints with KServe, you've likely felt the gap: vLLM is powerful but hits scaling walls as a single Pod, while KServe provides high-level abstractions but lacks inference-aware routing. llm-d fills exactly this gap as a middleware layer, delivering Disaggregated Serving, hierarchical KV Cache offloading, and prefix-cache-aware routing — all Kubernetes-native.

The Three Bottlenecks llm-d Solves

Running LLM inference in production consistently hits three core bottlenecks:

Bottleneck	Problem	llm-d Solution
Resource Imbalance	Prefill (prompt processing) is GPU compute-intensive; Decode (token generation) is memory bandwidth-intensive — running both in the same Pod caps GPU utilization at 40–60%	Disaggregated Serving — Separate Prefill/Decode into independent Pod Pools with independent scaling
KV Cache Waste	Repeated computation of identical system prompts; cache stored only in expensive GPU HBM; cache hit rates plummet in multi-tenant environments	Hierarchical KV Cache Offloading — GPU HBM → CPU DRAM → NVMe tiering + Prefix Caching
Routing Inefficiency	Standard Kubernetes Service uses round-robin/random routing — ignoring cache state, model loading status, and GPU topology	Endpoint Picker (EPP) — Prefix-cache-aware routing maximizes cache hit rates

Architecture Deep Dive

The core design philosophy of llm-d is "middleware between the inference engine (vLLM) and the orchestration layer (KServe)". It leverages vLLM's high-performance inference kernels while adding distributed scaling and intelligent routing as Kubernetes-native capabilities.

1. Gateway API Inference Extension (GAIE) & Endpoint Picker

llm-d implements the Kubernetes Gateway API Inference Extension (GAIE). Instead of default round-robin Service routing, the Endpoint Picker (EPP) computes a prefix hash for each request's prompt and routes it to the Pod that already has that prefix cached. In multi-tenant SaaS environments sharing the same system prompt, this maximizes KV Cache hit rates.

# InferencePool CRD — llm-d routing configuration
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: llm-pool
  namespace: llm-serving
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llm-d
  endpointPickerConfig:
    extensionRef:
      name: llm-d-epp
      group: ""
      kind: Service

2. Disaggregated Serving: Prefill/Decode Separation

LLM inference consists of two distinct phases. Prefill processes the entire input prompt at once to build the KV Cache — GPU compute-intensive. Decode reads the KV Cache to generate tokens one by one — memory bandwidth-bound. These phases have completely different hardware requirements, so running them in the same Pod wastes resources.

llm-d separates these phases into independent Pod Pools. The Prefill Pool runs on nodes with high GPU compute performance, while the Decode Pool runs on nodes with wide memory bandwidth — each auto-scaling independently.

# Prefill Pool — GPU compute-optimized nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-prefill
  namespace: llm-serving
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.8.0
        args:
        - --model=Qwen/Qwen3-32B
        - --tensor-parallel-size=2
        - --enable-prefix-caching
        - --kv-transfer-config='{"kv_connector":"PyNcclConnector"}'
        resources:
          limits:
            nvidia.com/gpu: "2"
      nodeSelector:
        llm-d/role: prefill
---
# Decode Pool — memory bandwidth-optimized nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-decode
  namespace: llm-serving
spec:
  replicas: 8
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.8.0
        args:
        - --model=Qwen/Qwen3-32B
        - --tensor-parallel-size=2
        - --enable-prefix-caching
        - --kv-transfer-config='{"kv_connector":"PyNcclConnector"}'
        resources:
          limits:
            nvidia.com/gpu: "2"
      nodeSelector:
        llm-d/role: decode

3. Hierarchical KV Cache Offloading

In LLM inference, the KV Cache lives in GPU HBM — the most expensive and capacity-limited memory. For models with long context windows (128K+ tokens), the KV Cache can consume most of the GPU memory.

llm-d's Hierarchical KV Cache Offloading uses a 3-tier memory hierarchy:

Cache Tier	Storage	Access Latency	Use Case
L1 Hot	GPU HBM (H100: 80GB)	< 1μs	Currently active inference sessions
L2 Warm	CPU DRAM (512GB–2TB)	10–50μs	Recently used cache, Prefix Cache
L3 Cold	NVMe SSD (multi-TB)	100–500μs	Cold cache, long context history

Combined with Prefix Caching, this dramatically reduces redundant computation in multi-tenant environments sharing system prompts.

4. LeaderWorkerSet (LWS) and Multi-Node Expert Parallelism

For Mixture of Experts (MoE) models or models with hundreds of billions of parameters that don't fit in a single node's GPU memory, llm-d uses Kubernetes LeaderWorkerSet (LWS) primitives to orchestrate tensor parallelism and Expert Parallelism across multiple nodes.

Performance Benchmarks: v0.5 Results with Qwen3-32B

Official benchmarks from llm-d v0.5 testing Qwen3-32B with 8 vLLM Pods on 16 NVIDIA H100 GPUs:

Metric	Baseline K8s Service	llm-d v0.5	Improvement
TTFT (Time to First Token)	P99 hundreds of ms	Near-Zero Latency	Significant
Throughput	Baseline	~120,000 tokens/sec	Linear scaling
GPU Utilization	40–60%	80%+	~2x
KV Cache Hit Rate	Low (random routing)	EPP Prefix-Aware	Major improvement

The Prefix Caching + EPP combination particularly shines in multi-tenant SaaS scenarios. When serving thousands of concurrent users sharing the same system prompt, TTFT approaches near-zero by routing requests to Pods that already hold the prefix cache.

v0.5 Key Features Summary

Feature	Description	Use Case
Hierarchical KV Offloading	GPU → CPU → NVMe 3-tier cache	128K+ long context, multi-session
Cache-Aware LoRA Routing	Routes requests to Pods with the correct LoRA adapter	Per-customer fine-tuned model serving
Resilient Networking (UCCL)	NVIDIA UCCL-based high-speed GPU interconnect	Multi-node tensor parallelism
Scale-to-Zero Autoscaling	Scales Pod count to 0 when no traffic	Cost optimization (nights/weekends)
Wide Expert Parallelism	Distributes MoE Experts across multiple nodes	Mixtral, DeepSeek MoE models
Open Benchmarking	Standardized, reproducible benchmark framework	Hardware/config comparison

Hardware Agnosticism

A core design principle of llm-d is vendor neutrality. It supports NVIDIA H100/A100, AMD MI300X, Intel Gaudi, and Google TPU v5. You can even run heterogeneous accelerator configurations — Prefill Pool on NVIDIA H100 for compute, Decode Pool on AMD MI300X for memory bandwidth — managed declaratively with Kubernetes nodeSelector and DRA.

Quick Start Guide

# 1. Prerequisites
kubectl version --client   # v1.30+ recommended
helm version               # v3.12+
nvidia-smi                 # GPU driver check

# 2. Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml

# 3. Install Gateway API Inference Extension
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

# 4. Deploy llm-d via Helm
helm repo add llm-d https://llm-d.github.io/llm-d-deployer
helm repo update

helm install llm-d llm-d/llm-d \
  --namespace llm-serving \
  --create-namespace \
  --set model.name=Qwen/Qwen3-32B \
  --set prefill.replicas=2 \
  --set decode.replicas=4 \
  --set gpu.type=nvidia-h100 \
  --set autoscaling.enabled=true \
  --set autoscaling.scaleToZero=true

# 5. Verify deployment
kubectl get pods -n llm-serving -w
kubectl get inferencepool -n llm-serving

llm-d vs Existing Solutions

Feature	vLLM Standalone	KServe + vLLM	llm-d + vLLM
Disaggregated Serving	Not supported	Not supported	Prefill/Decode independent Pools
KV Cache Tiering	GPU HBM only	GPU HBM only	GPU → CPU → NVMe
Routing	Single Pod	Round-robin	Prefix-Cache-Aware EPP
Multi-Node Parallel	Manual setup	Limited	LWS + NCCL/UCCL native
LoRA Routing	Single Pod only	Not supported	Cache-Aware LoRA routing
Scale-to-Zero	Not supported	Requires Knative	Native support
Hardware	NVIDIA-centric	NVIDIA-centric	NVIDIA, AMD, Intel, TPU

Practical Considerations

Be aware this is a CNCF Sandbox project. Sandbox is CNCF's early-stage designation, meaning production stability is not yet fully validated. Features are evolving rapidly, and breaking API changes may occur. Test thoroughly in staging before production deployment.

Coexistence with KServe. llm-d complements rather than replaces KServe. KServe handles model lifecycle (deployment, rollout, canary), while llm-d handles inference-specific routing and cache optimization as a layered architecture.

Monitoring and benchmarking. Use llm-d's Open Benchmarking framework to quantitatively compare TTFT, TPOT, throughput, and KV Cache utilization before and after adoption.

Conclusion

llm-d joining the CNCF Sandbox marks a significant milestone: Kubernetes is evolving into the de facto standard for AI inference infrastructure. When IBM, Red Hat, and Google donate their framework to CNCF, and NVIDIA, AMD, and Intel all join as partners, the industry consensus is clear.

If you're already using vLLM or KServe, llm-d is a natural extension of your stack — boosting GPU utilization with Disaggregated Serving, reducing TTFT with prefix-cache-aware routing, and optimizing memory costs with Hierarchical KV Cache. Given its Sandbox status, we recommend validating quantitative gains through Open Benchmarking in staging before production rollout.

References

llm-d Official Site — Docs, quickstart, guides
CNCF Blog: Welcome llm-d to the CNCF
IBM Research: Donating llm-d to CNCF
GitHub: llm-d Organization

This article was written with AI assistance (Claude) and reviewed by the ManoIT editorial team. Technical facts were cross-verified against official documentation.

Originally published at ManoIT Tech Blog.

DEV Community