DEV Community

daniel jeong
daniel jeong

Posted on • Originally published at manoit.co.kr

Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference

Complete Guide to llm-d CNCF Sandbox — Kubernetes-Native Distributed LLM Inference Framework

At KubeCon Europe 2026 in Amsterdam, IBM Research, Red Hat, and Google Cloud jointly donated llm-d to the CNCF as a Sandbox project. Backed by founding partners including NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, llm-d is a distributed inference framework designed to run large language model (LLM) inference at production scale on Kubernetes.

If you've served models with vLLM or managed inference endpoints with KServe, you've likely felt the gap: vLLM is powerful but hits scaling walls as a single Pod, while KServe provides high-level abstractions but lacks inference-aware routing. llm-d fills exactly this gap as a middleware layer, delivering Disaggregated Serving, hierarchical KV Cache offloading, and prefix-cache-aware routing — all Kubernetes-native.

The Three Bottlenecks llm-d Solves

Running LLM inference in production consistently hits three core bottlenecks:

Bottleneck Problem llm-d Solution
Resource Imbalance Prefill (prompt processing) is GPU compute-intensive; Decode (token generation) is memory bandwidth-intensive — running both in the same Pod caps GPU utilization at 40–60% Disaggregated Serving — Separate Prefill/Decode into independent Pod Pools with independent scaling
KV Cache Waste Repeated computation of identical system prompts; cache stored only in expensive GPU HBM; cache hit rates plummet in multi-tenant environments Hierarchical KV Cache Offloading — GPU HBM → CPU DRAM → NVMe tiering + Prefix Caching
Routing Inefficiency Standard Kubernetes Service uses round-robin/random routing — ignoring cache state, model loading status, and GPU topology Endpoint Picker (EPP) — Prefix-cache-aware routing maximizes cache hit rates

Architecture Deep Dive

The core design philosophy of llm-d is "middleware between the inference engine (vLLM) and the orchestration layer (KServe)". It leverages vLLM's high-performance inference kernels while adding distributed scaling and intelligent routing as Kubernetes-native capabilities.

llm-d Architecture: Kubernetes-Native Distributed LLM Inference

1. Gateway API Inference Extension (GAIE) & Endpoint Picker

llm-d implements the Kubernetes Gateway API Inference Extension (GAIE). Instead of default round-robin Service routing, the Endpoint Picker (EPP) computes a prefix hash for each request's prompt and routes it to the Pod that already has that prefix cached. In multi-tenant SaaS environments sharing the same system prompt, this maximizes KV Cache hit rates.

# InferencePool CRD — llm-d routing configuration
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: llm-pool
  namespace: llm-serving
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llm-d
  endpointPickerConfig:
    extensionRef:
      name: llm-d-epp
      group: ""
      kind: Service
Enter fullscreen mode Exit fullscreen mode

2. Disaggregated Serving: Prefill/Decode Separation

LLM inference consists of two distinct phases. Prefill processes the entire input prompt at once to build the KV Cache — GPU compute-intensive. Decode reads the KV Cache to generate tokens one by one — memory bandwidth-bound. These phases have completely different hardware requirements, so running them in the same Pod wastes resources.

llm-d separates these phases into independent Pod Pools. The Prefill Pool runs on nodes with high GPU compute performance, while the Decode Pool runs on nodes with wide memory bandwidth — each auto-scaling independently.

# Prefill Pool — GPU compute-optimized nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-prefill
  namespace: llm-serving
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.8.0
        args:
        - --model=Qwen/Qwen3-32B
        - --tensor-parallel-size=2
        - --enable-prefix-caching
        - --kv-transfer-config='{"kv_connector":"PyNcclConnector"}'
        resources:
          limits:
            nvidia.com/gpu: "2"
      nodeSelector:
        llm-d/role: prefill
---
# Decode Pool — memory bandwidth-optimized nodes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-decode
  namespace: llm-serving
spec:
  replicas: 8
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.8.0
        args:
        - --model=Qwen/Qwen3-32B
        - --tensor-parallel-size=2
        - --enable-prefix-caching
        - --kv-transfer-config='{"kv_connector":"PyNcclConnector"}'
        resources:
          limits:
            nvidia.com/gpu: "2"
      nodeSelector:
        llm-d/role: decode
Enter fullscreen mode Exit fullscreen mode

3. Hierarchical KV Cache Offloading

In LLM inference, the KV Cache lives in GPU HBM — the most expensive and capacity-limited memory. For models with long context windows (128K+ tokens), the KV Cache can consume most of the GPU memory.

llm-d's Hierarchical KV Cache Offloading uses a 3-tier memory hierarchy:

Cache Tier Storage Access Latency Use Case
L1 Hot GPU HBM (H100: 80GB) < 1μs Currently active inference sessions
L2 Warm CPU DRAM (512GB–2TB) 10–50μs Recently used cache, Prefix Cache
L3 Cold NVMe SSD (multi-TB) 100–500μs Cold cache, long context history

Combined with Prefix Caching, this dramatically reduces redundant computation in multi-tenant environments sharing system prompts.

4. LeaderWorkerSet (LWS) and Multi-Node Expert Parallelism

For Mixture of Experts (MoE) models or models with hundreds of billions of parameters that don't fit in a single node's GPU memory, llm-d uses Kubernetes LeaderWorkerSet (LWS) primitives to orchestrate tensor parallelism and Expert Parallelism across multiple nodes.

Performance Benchmarks: v0.5 Results with Qwen3-32B

Official benchmarks from llm-d v0.5 testing Qwen3-32B with 8 vLLM Pods on 16 NVIDIA H100 GPUs:

Metric Baseline K8s Service llm-d v0.5 Improvement
TTFT (Time to First Token) P99 hundreds of ms Near-Zero Latency Significant
Throughput Baseline ~120,000 tokens/sec Linear scaling
GPU Utilization 40–60% 80%+ ~2x
KV Cache Hit Rate Low (random routing) EPP Prefix-Aware Major improvement

The Prefix Caching + EPP combination particularly shines in multi-tenant SaaS scenarios. When serving thousands of concurrent users sharing the same system prompt, TTFT approaches near-zero by routing requests to Pods that already hold the prefix cache.

v0.5 Key Features Summary

Feature Description Use Case
Hierarchical KV Offloading GPU → CPU → NVMe 3-tier cache 128K+ long context, multi-session
Cache-Aware LoRA Routing Routes requests to Pods with the correct LoRA adapter Per-customer fine-tuned model serving
Resilient Networking (UCCL) NVIDIA UCCL-based high-speed GPU interconnect Multi-node tensor parallelism
Scale-to-Zero Autoscaling Scales Pod count to 0 when no traffic Cost optimization (nights/weekends)
Wide Expert Parallelism Distributes MoE Experts across multiple nodes Mixtral, DeepSeek MoE models
Open Benchmarking Standardized, reproducible benchmark framework Hardware/config comparison

Hardware Agnosticism

A core design principle of llm-d is vendor neutrality. It supports NVIDIA H100/A100, AMD MI300X, Intel Gaudi, and Google TPU v5. You can even run heterogeneous accelerator configurations — Prefill Pool on NVIDIA H100 for compute, Decode Pool on AMD MI300X for memory bandwidth — managed declaratively with Kubernetes nodeSelector and DRA.

Quick Start Guide

# 1. Prerequisites
kubectl version --client   # v1.30+ recommended
helm version               # v3.12+
nvidia-smi                 # GPU driver check

# 2. Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml

# 3. Install Gateway API Inference Extension
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

# 4. Deploy llm-d via Helm
helm repo add llm-d https://llm-d.github.io/llm-d-deployer
helm repo update

helm install llm-d llm-d/llm-d \
  --namespace llm-serving \
  --create-namespace \
  --set model.name=Qwen/Qwen3-32B \
  --set prefill.replicas=2 \
  --set decode.replicas=4 \
  --set gpu.type=nvidia-h100 \
  --set autoscaling.enabled=true \
  --set autoscaling.scaleToZero=true

# 5. Verify deployment
kubectl get pods -n llm-serving -w
kubectl get inferencepool -n llm-serving
Enter fullscreen mode Exit fullscreen mode

llm-d vs Existing Solutions

Feature vLLM Standalone KServe + vLLM llm-d + vLLM
Disaggregated Serving Not supported Not supported Prefill/Decode independent Pools
KV Cache Tiering GPU HBM only GPU HBM only GPU → CPU → NVMe
Routing Single Pod Round-robin Prefix-Cache-Aware EPP
Multi-Node Parallel Manual setup Limited LWS + NCCL/UCCL native
LoRA Routing Single Pod only Not supported Cache-Aware LoRA routing
Scale-to-Zero Not supported Requires Knative Native support
Hardware NVIDIA-centric NVIDIA-centric NVIDIA, AMD, Intel, TPU

Practical Considerations

Be aware this is a CNCF Sandbox project. Sandbox is CNCF's early-stage designation, meaning production stability is not yet fully validated. Features are evolving rapidly, and breaking API changes may occur. Test thoroughly in staging before production deployment.

Coexistence with KServe. llm-d complements rather than replaces KServe. KServe handles model lifecycle (deployment, rollout, canary), while llm-d handles inference-specific routing and cache optimization as a layered architecture.

Monitoring and benchmarking. Use llm-d's Open Benchmarking framework to quantitatively compare TTFT, TPOT, throughput, and KV Cache utilization before and after adoption.

Conclusion

llm-d joining the CNCF Sandbox marks a significant milestone: Kubernetes is evolving into the de facto standard for AI inference infrastructure. When IBM, Red Hat, and Google donate their framework to CNCF, and NVIDIA, AMD, and Intel all join as partners, the industry consensus is clear.

If you're already using vLLM or KServe, llm-d is a natural extension of your stack — boosting GPU utilization with Disaggregated Serving, reducing TTFT with prefix-cache-aware routing, and optimizing memory costs with Hierarchical KV Cache. Given its Sandbox status, we recommend validating quantitative gains through Open Benchmarking in staging before production rollout.

References


This article was written with AI assistance (Claude) and reviewed by the ManoIT editorial team. Technical facts were cross-verified against official documentation.


Originally published at ManoIT Tech Blog.

Top comments (0)