Delafosse Olivier

Posted on Apr 1 • Originally published at coreprose.com

Red Hat S Llm D Joins Cncf Kubernetes Native Llm Inference At Scale

#ai #machinelearning #llm #programming

Originally published on CoreProse KB-incidents

Red Hat’s contribution of llm-d to the CNCF Sandbox makes Kubernetes a first-class platform for LLM inference, not just a “good enough” runtime.[1]

By treating accelerators, topology, and KV cache as programmable resources, llm-d turns existing Kubernetes clusters into shared AI fabrics instead of isolated inference stacks.[4][7]

💡 Key idea: llm-d makes LLM inference a cloud native workload governed by open standards and CNCF processes, not vendor-specific systems.[1]

1. Why llm-d Matters for Kubernetes and CNCF

llm-d’s CNCF Sandbox status anchors LLM inference in neutral, open governance similar to Kubernetes itself.[1]

Ensures APIs, patterns, and scheduling semantics evolve under Linux Foundation stewardship.
Reduces lock-in risk versus proprietary inference platforms.

The project’s origins highlight broad neutrality:

Launched in May 2025 by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA.[1]
Expanded to AMD, Cisco, Hugging Face, Intel, Lambda, Mistral AI, and universities.[1][10]
Signals alignment on a shared, Kubernetes-native inference approach.

⚡ Strategic shift: Designed for “any model, any accelerator, any cloud,” targeting heterogeneous, multi-cloud clusters with GPUs, TPUs, and custom ASICs.[1][3]

llm-d is:

A vehicle to evolve Kubernetes into state-of-the-art AI infrastructure.[1]
Focused on production serving: performance per dollar, multi-tenancy, and SLOs.[7][9]
Aimed at platform/DevOps teams, not just researchers.

💼 Section takeaway: With llm-d in CNCF, Kubernetes becomes the default place to standardize LLM serving, scheduling, and optimization across vendors and clouds.

2. Core Architecture: Distributed Inference Built for Kubernetes

llm-d provides a Kubernetes-native architecture for distributed inference, built on vLLM plus an inference scheduler, cache-aware routing, and disaggregated serving.[2][7] It embeds into Kubernetes rather than replacing it.

Disaggregated prefill and decode

Inference is split into two phases:

Prefill: Compute-heavy, builds KV cache for input tokens.
Decode: Memory-bandwidth-bound, consumes KV cache to generate tokens.[8]

llm-d can run these on different replicas and accelerator types, so GPUs are used where they matter instead of over-provisioning every pod.[3][8]

flowchart LR
A[Client Request] --> B[Inference Gateway]
B --> C[Prefill Servers]
C --> D[KV Cache Store]
D --> E[Decode Servers]
E --> F[Response Stream]
style C fill:#22c55e,color:#fff
style E fill:#0ea5e9,color:#fff

📊 Architecture insight: Disaggregation replaces “one big GPU per pod” with a tunable pipeline per phase, workload, and accelerator.[3][8]

Integration with Inference Gateway

llm-d integrates with the Kubernetes Inference Gateway (IGW):[2][7]

Applications call a stable gateway API.
Platform teams optimize routing, placement, and scaling internally.
Models, policies, and accelerator layouts can change without touching app code.

Topology-aware scheduling

The scheduler understands:

GPU peer-to-peer connectivity
NUMA layout and local memory bandwidth
Network fabrics and cross-node bandwidth[3][10]

Using this topology, llm-d:

Routes requests to meet latency SLOs at lowest cost.
Avoids naive balancing by CPU or generic utilization.[3][10]

Guides and Helm recipes provide “well-lit paths” for deploying llm-d across tens or hundreds of nodes, single- or multi-tenant.[9]

⚠️ Section takeaway: llm-d makes inference architecture a native Kubernetes concern, combining vLLM, IGW, and topology-aware scheduling into a reproducible stack.

3. Performance and Cost Optimizations for Enterprise LLMs

llm-d focuses on levers that determine whether LLMs are economically viable at scale.

KV cache aware routing

KV cache aware routing sends follow-up or similar prompts to cache-warm nodes, avoiding repeated prefill work.[2][7]

Especially valuable for multi-step prompts, agents, and RAG.
Reduces tail latency and jitter.

flowchart LR
A[New Prompt] --> B{Cache Hit?}
B -- Yes --> C[Route to Warm Node]
B -- No --> D[Route to Any Node]
C --> E[Low Latency Response]
D --> F[Prefill + Decode]
style C fill:#22c55e,color:#fff
style F fill:#f59e0b,color:#fff

📊 Practical effect: Users see better latency from cache-warm routing and higher GPU utilization by assigning accelerators to specific pipeline stages instead of cloning full stacks per replica.[2]

Disaggregated serving and workload-aware scheduling

Separating prefill and decode lets llm-d:

Reduce duplicate model state replication.[2][8]
Assign hardware by workload shape (short chat, long-context, large batch).[3][8]

Improve:

Cost per request via fewer fully replicated servers.[2][8]
Time-to-first-token (TTFT) with prefill-optimized nodes.
Time-per-output-token (TPOT) via stable decode pipelines.[9]

llm-d is tuned for:

Long-running multi-step prompts
Retrieval-augmented generation
Agentic workflows[6][7]

These high-value enterprise patterns stress cache management and scheduling.

Vendors like Mistral AI note that next-gen models (e.g., Mixture of Experts) require robust KV cache management and disaggregated serving—exactly llm-d’s focus.[1]

💡 Section takeaway: llm-d exposes cache locality and phase-aware scheduling as explicit controls, turning raw accelerator capacity into better latency and lower cost for real workloads.

4. Multi-Accelerator and Topology-Aware Inference

The same mechanisms also let llm-d treat heterogeneous hardware as one programmable pool. Modern clusters often mix:

High-end GPUs for interactive chat
Memory-rich accelerators for long-context reasoning
Custom ASICs/TPUs for batch or offline jobs[3]

llm-d offers:

A unified recipe and scheduler that understands accelerator classes.
Hardware selection based on workload pattern, not manual guesswork.[3]

Topology and interconnect awareness

llm-d surfaces interconnect details—from NUMA layouts to network fabrics and GPU peer-to-peer bandwidth—so communication-heavy workloads land where overhead is minimized.[3][10]

Expressed via Kubernetes primitives:

Node labels/taints for accelerator type and topology
Affinity/anti-affinity and scheduling constraints
Standard observability for monitoring hot paths[3][9]

flowchart TB
A[Workload Type] --> B{Chat}
A --> C{Long Context}
A --> D{Batch}
B --> E[Low-latency GPUs]
C --> F[High-memory Nodes]
D --> G[Cost-optimized ASICs]
style E fill:#22c55e,color:#fff
style F fill:#0ea5e9,color:#fff
style G fill:#f59e0b,color:#fff

📊 Planning aid: Platform teams get a practical scorecard for mixing accelerators by workload—chat, long-context, batch—rather than guessing hardware purchases and placement.[3]

This multi-accelerator strategy aligns with industry trends: GPU and CPU vendors back llm-d so their hardware participates in a standardized, open inference stack.[1][10]

⚡ Section takeaway: llm-d turns heterogeneous hardware and complex topology into declarative scheduling inputs, enabling portable, vendor-neutral AI fabrics.

5. Adoption Path: From First Cluster to Production Platform

llm-d pairs advanced capabilities with a realistic adoption path.

From quickstart to optimized platforms

Official guides and Helm charts provide:

Tested, benchmarked recipes for high-performance deployments.[9]
Requirements: only basic Kubernetes familiarity.

Targets:

Single-model deployments across tens/hundreds of nodes
Multi-tenant model-as-a-service platforms sharing deployments[9]

The “well-lit path” includes curated configs for:

Intelligent inference scheduling
Prefill/decode disaggregation
KV cache aware routing[9]

flowchart LR
A[Quickstart Cluster] --> B[Intelligent Scheduling]
B --> C[Prefill/Decode Split]
C --> D[KV Cache Tests]
D --> E[Multi-tenant Platform]
style A fill:#e5e7eb
style E fill:#22c55e,color:#fff

Red Hat’s guidance helps teams:

Validate KV cache aware routing.
Measure latency and cost improvements against their own workloads.[7][8]

Community-driven evolution

Cloud Native FM discussions with Red Hat engineers frame llm-d as:

A practical toolset that strengthens Kubernetes for enterprise LLM inference, not a silver bullet.[2]
A CNCF Sandbox project inviting contributions from operators, vendors, and researchers.[1][7]

This ensures llm-d tracks rapid shifts in:

Model architectures
Accelerator types
Workload patterns

💼 Section takeaway: With opinionated docs, Helm recipes, and open governance, llm-d offers a low-friction path from first experiment to production-grade, multi-tenant LLM platforms.

Conclusion: Turning Kubernetes into an AI Fabric

By contributing llm-d to CNCF, Red Hat and partners are defining a Kubernetes-native, vendor-neutral standard for distributed LLM inference across accelerators, topologies, and clouds.[1][3][7]

Platform teams can manage GPUs, KV caches, and cluster fabric as programmable resources within the same ecosystem that standardized containers and microservices.

⚡ Call to action:
Platform teams should:

Pilot llm-d using official guides and Helm recipes.[9]
Benchmark KV cache aware routing and disaggregated serving against current stacks.[8]
Engage with the CNCF llm-d community to influence features and roadmap as generative AI evolves.[2]

Early adopters will help shape—and benefit from—the next generation of cloud native AI infrastructure.

Sources & References (6)

1Welcome llm-d to the CNCF: Evolving Kubernetes into SOTA AI infrastructure | CNCF Welcome llm-d to the CNCF: Evolving Kubernetes into SOTA AI infrastructure

Posted on March 24, 2026 by Carlos Costa (IBM Research), Clayton Coleman (Google), and Rob Shaw (Red Hat)

We are thrilled t...2How to Run LLMs on Kubernetes with llm-d: A Distributed Inference Stack Saim Safder – Cloud Native FM on LinkedIn

Is Kubernetes enough to run enterprise LLMs? It’s close, but only when paired with purpose-built layers. In this episode of Cloud Native FM, we introduce llm...3Llm-d: Multi-Accelerator LLM Inference on Kubernetes - Erwan Gallen, Red Hat Llm-d: Multi-Accelerator LLM Inference on Kubernetes - Erwan Gallen, Red Hat

Large language model serving has grown beyond one GPU per pod. Kubernetes clusters now mix GPUs, TPUs and custom AI ASICs,...4Getting started with llm-d for distributed AI inference Getting started with llm-d for distributed AI inference

llm-d: Kubernetes-native distributed inference stack for large-scale LLM applications

August 19, 2025

Cedric Clyburn, Philip Hayes

Related t...- 5Guides | llm-d Our guides provide tested and benchmarked recipes and Helm charts to serve large language models (LLMs) at peak performance with best practices common to production deployments. A familiarity with bas...

6Deploying llm-d in Kubernetes: The Future of Distributed AI Inference at Scale # Deploying llm-d in Kubernetes: The Future of Distributed AI Inference at Scale

Introduction

llm-d is a new open source community project designed to enable scalable distributed generative AI infer...
Generated by CoreProse in 1m 43s

6 sources verified & cross-referenced 1,287 words 0 false citationsShare this article

X LinkedIn Copy link Generated in 1m 43s### What topic do you want to cover?

Get the same quality with verified sources on any subject.

Go 1m 43s • 6 sources ### What topic do you want to cover?

This article was generated in under 2 minutes.

Generate my article 📡### Trend Radar

Discover the hottest AI topics updated every 4 hours

Explore trends ### Related articles

2,000-Run Benchmark Blueprint: Comparing LangChain, AutoGen, CrewAI & LangGraph for Production-Grade Agentic AI

Hallucinations#### How Chainalysis Can Use AI Agents to Automate Crypto Investigations and Compliance

Safety#### How HPE AI Agents Halve Root Cause Analysis Time for Modern Ops

performance#### March 2026 AI Production Failure Modes: How Prompt Injection, Scope Creep, and Miscalibrated Confidence Break Real Systems

Safety

About CoreProse: Research-first AI content generation with verified citations. Zero hallucinations.

🔗 Try CoreProse | 📚 More KB Incidents

DEV Community