Anil Kurmi

Posted on Apr 19

LLM-D Launches: Kubernetes-Native Distributed Inference

#ai #webdev #kubernetes #programming

It's Tuesday afternoon. An SRE at a mid-sized fintech is staring at a P90 latency dashboard that just flipped from a calm 0.5 seconds to an ugly 8 seconds. Same GPU fleet. Same model. No traffic spike. Every pod shows 40% utilization. The on-call channel is a blizzard of "rolling back?" messages.

The actual bug: customer A's 6,000-token system prompt was sitting warm in HBM. Customer B arrived, the scheduler promoted B's prefix into HBM, and A's cache got evicted down to DRAM. The next time A came back, the router — blind to where A's prefix had actually gone — sent the request to a pod that now had to pull the prefix from a slower tier. P90 went 16× off a cliff while the capacity graph stayed flat.

This is the "cache partition cascade." It's the exact bug the llm-d project, announced this week as a CNCF Sandbox project, is built to eliminate. And it's the reason your token bill is about to flip 180° — if you understand it.

5-Minute Skim

What changed: llm-d — a Kubernetes-native distributed inference stack — landed in the CNCF Sandbox backed by Google Cloud, Red Hat, IBM, NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. The v0.5 release validated 3.1k tokens/sec per B200 on decode and 50k tokens/sec on a 16×16 B200 topology.

Default recommendation: If you run self-hosted vLLM at scale and your workloads share long prefixes (support bots, ads ranking, legal Q&A, agents), adopt llm-d. If you do one-shot inference with unique prompts, stay on vanilla vLLM — the disaggregation overhead won't pay for itself.

What breaks: The naive "one-pod-per-replica" vLLM deployment. Cache-hit economics completely dominate; if you aren't measuring prefix-cache hit rate per tenant, you are flying blind. Also breaks: any mental model where "more GPUs = lower latency." llm-d showed a 57× TTFT improvement with the same 16 H100s.

Key trade-off: llm-d gives you 25-70% higher throughput and 10× cheaper cached tokens ($0.30 vs $3.00 per million) — but you inherit a scheduler, a multi-tier KV cache, and a transport layer (NIXL/UCCL) you now have to operate. Managed services like Bedrock hide all of that; you pay for the hiding.

Why did this hit the wires this week?

Two things converged. First, llm-d formally entered the CNCF Sandbox on April 13 with a coalition that spans every major compute supplier — hyperscalers, chip vendors, neocloud operators, model labs. That's unusual. Kubernetes itself didn't launch with that kind of cross-vendor consensus.

Second, the economic pressure became impossible to ignore. Meta published two pieces this week — "Capacity Efficiency at Meta" on April 16 and "KernelEvolve" on April 2 — describing AI agents that claw back hundreds of megawatts of capacity from existing fleets through automated infrastructure optimization. KernelEvolve alone reported a 60% throughput gain on the Andromeda ads model. When Meta's own ML infrastructure team is sending agents to rewrite CUDA kernels, the industry message is clear: inference is now a capacity-efficiency problem, not a model-quality problem.

AMD's MLPerf 6.0 results dropped in the same window — the MI355X posted 1.08-1.2× uplift, and for the first time competitive inference numbers exist outside the NVIDIA stack. A Kubernetes-native, hardware-neutral control plane suddenly has much bigger stakes.

What does llm-d actually do?

Three moves, each non-obvious, each compounding.

Move one: disaggregate prefill from decode. A transformer inference request has two phases. Prefill processes the input prompt in parallel — it's compute-bound and loves fat GPUs. Decode generates tokens one at a time — it's memory-bandwidth-bound and wastes compute. Running them on the same pod means your decode phase starves a prefill-optimized GPU, or your prefill phase bottlenecks a decode-optimized one. llm-d splits them onto separate pools: prefill pods (typically 8) and decode pods (typically 16), connected via a high-speed transport.

Move two: multi-tier KV cache. Every token you generate needs the model's attention over every previous token — the "KV cache." For a 6K-token prompt, that cache is hundreds of megabytes per request. llm-d stores it across a hierarchy: HBM (fastest, scarcest) → DRAM (10× cheaper, 5× slower) → NVMe (100× cheaper, 50× slower) → distributed storage. The NIXL protocol moves cache blocks between tiers on demand. Cache hits in HBM cost you $0.30 per million tokens. Misses that fetch from cold storage cost $3.00. Same model, same request — 10× cost delta driven entirely by where the prefix lives.

Move three: scheduler-aware routing via Kubernetes Gateway API. The scheduler doesn't just know which pod is healthy. It knows which pod holds which prefix in which tier. When a request arrives with a known prefix, it routes to the pod that already has the KV cache warm. When no pod does, it routes to minimize transfer cost. The Gateway API integration means this is a first-class Kubernetes concept, not a sidecar hack.

Underneath, llm-d still runs vLLM — PagedAttention, continuous batching, OpenAI-compatible API. It's not a replacement. It's the control plane vLLM always needed.

Five nodes, one story: the gateway sees the request, picks a prefill pod with (or near) the warm cache, hands the KV state to a decode pod over NIXL, and tiers inactive cache to cheaper memory. No node does two jobs.

What do real deployments look like?

Meta Capacity Efficiency (April 16). Meta deployed unified AI agents across its fleet that analyze traces, propose kernel rewrites, and re-partition workloads. The reported recovery: hundreds of megawatts. Not a model improvement — a scheduling and kernel-fusion improvement on existing silicon. This is the same philosophy llm-d exposes to the rest of us: the gains live in the scheduler and the memory hierarchy, not the chip.

Meta KernelEvolve (April 2). A "ranking engineer agent" that optimizes CUDA kernels for the Andromeda ads model. 60% throughput gain. Meta's takeaway: human engineers can't explore the kernel search space fast enough, and the kernels evolve faster than the model does. For llm-d users, the corollary is that you want a control plane that can swap kernels and routing rules without a redeploy. llm-d's Kubernetes-native design lets you do exactly that via CRD updates.

DeepSeek-V3 in production. Running on H200 with vLLM plus Wide-EP (wide expert parallelism), DeepSeek reported 2.2k tokens/sec per H200 and a 40% per-token latency reduction. The Wide-EP trick — spreading MoE experts across many GPUs — only works with a scheduler that understands which expert lives where. That is exactly what llm-d formalizes.

AWS disaggregated inference. AWS published a post on April 15 introducing disaggregated inference on EKS powered by llm-d. Same primitives, different cloud. The coalition is real.

The Cache Partition Cascade

Here's the war story in full, because the numbers matter.

An enterprise customer running llm-d v0.4 — pre-fix — deployed 8 prefill pods and 16 decode pods on 16 H100s. Workload: multi-tenant customer support. Average context: 6K tokens of system prompt plus ~500 tokens of conversation history. Classic cache-hit workload.

Monday, 14:00. Customer A's 6K prefix fills HBM on prefill pod #3. TTFT for A: 540ms. Beautiful.

Monday, 14:12. Customer B arrives. B's prefix is different but similar in size. The scheduler, correctly, promotes B into HBM on pod #3 — B is active, A has gone quiet. A's KV cache is evicted down to DRAM.

Monday, 14:14. A sends a follow-up. Here's the bug: the scheduler routed A's follow-up to pod #3 because the prefix hash still pointed there. But pod #3 no longer had A's cache in HBM — it was two tiers down. The pod had to fetch the KV blocks back over NIXL, rebuild the attention state, and only then start decoding. TTFT for A's follow-up: 8.6 seconds. 16× degradation.

Meanwhile, the GPU utilization graph stayed at a comfortable 40%. The SLO breached. Capacity planning said everything was fine.

The v0.5 fix (shipped April 2026) does three things:

Cache-aware LoRA routing. The scheduler now tracks which tier holds each prefix, not just which pod.
Inline cost function. HBM hit beats DRAM hit beats miss-plus-fetch. The scheduler scores candidates on expected latency, not just locality.
UCCL-based transport HA. The NIXL fallback path no longer stalls when a peer pod is evicting; it fails over to a replica tier.

Post-fix, the same workload's P90 dropped to 620ms under identical tenant churn.

Lesson: in disaggregated inference, your scheduler's world-model of the cache is the system. Lie to it — or let it go stale — and no amount of GPU capacity saves you.

How does llm-d compare to Ray Serve, Modal, and Bedrock?

I've seen teams pick each. Here's how the debate actually runs.

llm-d vs Ray Serve. Ray Serve is a general-purpose Python serving framework — it can host anything callable. That generality is the cost. Ray has no native concept of prefill/decode split, no KV-cache tiering, no prefix-aware routing. You can build those on top, and plenty of teams have, but you're building the llm-d feature set by hand. If your workload is LLM-dominated, llm-d starts you 18 months ahead. If you're serving a zoo of ML models — rankers, embeddings, a few LLMs — Ray stays competitive because the LLM isn't the only customer.

llm-d vs Modal. Modal's pitch is per-second billing and zero ops. That's seductive until you realize inference traffic is rarely bursty enough to benefit. Customer support bots, ads serving, legal Q&A — these run a steady baseline 24/7. Modal's economics collapse above 50 concurrent users because you're paying a premium for elasticity you aren't using. Modal remains excellent for experimentation, nightly eval jobs, and genuinely bursty workloads (batch document processing, overnight agents). For steady-state production serving, llm-d on reserved capacity wins on pure $/token.

llm-d vs AWS Bedrock. Bedrock hides everything — no scheduler to tune, no KV cache to partition, no pods to patch. You pay a roughly 2-3× premium over self-hosted llm-d on equivalent hardware. For teams without a dedicated ML infra function, that premium is cheap. For teams burning >$100K/month on inference, llm-d pays back the ops cost in weeks. The split point is roughly where you'd hire a dedicated ML infra engineer anyway.

The honest answer: llm-d wins when (a) you have cache-reusable workloads, (b) you have the operational muscle to run Kubernetes plus a specialized control plane, and (c) your token volume makes the hiring math work. Below that threshold, managed services aren't stupid — they're correct.

When should you adopt, and when should you skip?

Adopt if:

Your prefix-cache hit rate (measure it today on vanilla vLLM) is above 30%. Support bots, ads, agents, and RAG systems routinely hit 60-80%.
Your average context is over 2K tokens. Cache tiering only earns its keep when the cached state is worth paging.
You run at least 8 GPUs in a single inference fleet. Below that, the disaggregation overhead dominates.
You already run Kubernetes in production. llm-d assumes you're fluent with CRDs, Gateway API, and pod-level networking.

Skip if:

Your workloads are one-shot — every prompt is unique. Cache tiering is dead weight; stick with vLLM's built-in scheduling.
You have fewer than 4 GPUs. The orchestration cost exceeds the throughput gain.
You don't have an on-call team that understands GPU memory hierarchies. When the cache cascade hits, you need someone who knows what NIXL is.
You're on pre-H100 hardware. The cache-tier bandwidth assumptions don't hold.

A middle path: run llm-d as a pilot on one workload — preferably your highest cache-hit workload — for a quarter before committing. v0.5 is stable, but the operational playbook is still being written in public.

Actionable takeaways

Measure prefix-cache hit rate per tenant this week. If you're on vLLM, this is a Prometheus scrape away. It's the single number that predicts your llm-d ROI.
Alert on cache-tier residency, not just GPU utilization. The cache cascade was invisible on GPU graphs. Build a dashboard for HBM/DRAM/NVMe occupancy and eviction rate.
Separate prefill and decode traffic in your load tests. If you test with a single request type, you'll miss the disaggregation economics entirely.
Budget for NVIDIA BlueField-4 (H2 2026). NVIDIA's CMX platform extends the cache hierarchy to 4 tiers with 5× sustained TPS on long-context agentic workloads. If your roadmap includes 100K+ context agents, plan the hardware refresh now.
Pilot llm-d on one high-cache-hit workload this quarter. Don't rip-and-replace. Prove the economics on one tenant, then expand.

Deep Dive Resources

Google Cloud: Enhancing vLLM for distributed inference with llm-d — The architectural overview with benchmark methodology.
llm-d on GitHub — Source, CRDs, and the v0.5 release notes with the cache-aware routing fix.
AWS: Disaggregated inference on AWS powered by llm-d — EKS deployment walkthrough.
Meta Engineering: Capacity Efficiency at Meta — The "hundreds of megawatts" story.
NVIDIA: BlueField-4 Inference Context Memory Storage — Where the 4-tier cache hierarchy is going in H2 2026.

Sources & Attribution

Google Cloud Blog, "Enhancing vLLM for distributed inference with llm-d," April 2026
Meta Engineering Blog, "Capacity Efficiency at Meta," April 16, 2026
Meta Engineering Blog, "KernelEvolve," April 2, 2026
AWS ML Blog, "Introducing disaggregated inference on AWS powered by llm-d," April 2026
NVIDIA Developer Blog, "Introducing NVIDIA BlueField-4," April 2026
llm-d GitHub repository, v0.5 release notes
MLPerf Inference 6.0 results, April 2026
DeepSeek-V3 production deployment reports, April 2026

DEV Community