DEV Community

Saim Safdar
Saim Safdar

Posted on

Introduction to llm-d Open-source Kubernetes-native Framework for Distributed LLM Inference | Ep 140 #cloudnativefm

I recently had a great conversation with the Red Hat team about llm_d, a new open-source effort that’s starting to tackle a problem we’re seeing more and more in production ML stacks: inference workloads becoming monolithic, heavy, and hard to scale.

A few highlights from the discussion and why llm-d matters:

Inspiration: llm-d draws a lot of inspiration from work done in projects like vLLM which optimized inference on everything from laptops to DGX clusters (caching, speculative decoding, distribution).

The problem: Today we often run inference as one big container (model + runtime + observability + config + pipelines). When you scale, you end up copying too much state across nodes, inefficient and brittle.

The idea: Treat the model and its runtime as disaggregated, first-class components inside Kubernetes. Break the container into parts (cache, prefill/decode, GPU-bound work, CPU-bound work) and let the platform place and scale each piece independently.

Why it’s promising: cache-aware routing and componentized serving lets you avoid unnecessary duplication, match workloads to the right resources (GPU vs CPU), and enable smarter scaling across clusters, which can reduce cost and improve responsiveness.

The opportunity: If you’re building ML infra or platform capabilities, this opens a path to far more efficient inference at scale, especially as model sizes continue to grow.

llm-d is still early, but it’s a practical, infrastructure-first approach to a real industry pain point. I’ll share a clip of our conversation and a short explainer diagram — would love to hear from folks building inference at scale:

Question: How are you scaling inference today, monolithic instances, autoscaling replicas, or something disaggregated? Drop a comment, I’d love to compare notes.

Top comments (0)