LLMKube started as a Kubernetes operator for llama.cpp. You define a Model, define an InferenceService, and the controller handles GPU scheduling, health probes, model downloads, and Prometheus metrics. It works well for GGUF models.
But llama.cpp isn't the only inference engine. vLLM has PagedAttention. TGI has continuous batching. PersonaPlex does real-time voice AI. Triton serves multi-framework models. Locking the operator to one runtime limits what you can deploy.
v0.6.0 changes that with pluggable runtime backends.
The Problem
Before v0.6.0, the controller's constructDeployment() was hardcoded to llama.cpp. Container name, image, command-line args, health probes, model provisioning, everything assumed llama.cpp. If you wanted to deploy vLLM, you had to create a manual Kubernetes Deployment outside of LLMKube.
The Fix
A RuntimeBackend interface that each inference engine implements:
type RuntimeBackend interface {
ContainerName() string
DefaultImage() string
DefaultPort() int32
BuildArgs(isvc, model, modelPath, port) []string
BuildProbes(port) (startup, liveness, readiness)
NeedsModelInit() bool
}
The controller calls resolveBackend(isvc) based on the runtime field in the CRD, then delegates all container configuration to the backend. llama.cpp is the default. New runtimes register in a simple switch statement.
Testing It: PersonaPlex on Kubernetes
To prove the architecture works, I deployed NVIDIA's PersonaPlex on my home lab. PersonaPlex is a 7B speech-to-speech model based on Moshi. It listens and talks at the same time. Sub-300ms latency for interruptions. Completely different from llama.cpp: PyTorch runtime, WebSocket-based health checks, model downloaded via HuggingFace token.
The InferenceService CRD:
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: personaplex
namespace: voice-ai
spec:
modelRef: personaplex-7b
runtime: personaplex
image: registry.defilan.net/personaplex:7b-v1-4bit-cuda13
personaPlexConfig:
quantize4Bit: true
hfTokenSecretRef:
name: hf-token
key: HF_TOKEN
endpoint:
port: 8998
type: NodePort
resources:
gpu: 1
memory: "32Gi"
kubectl apply and it's running. The controller:
- Sets the container command to
python -m moshi.server(via the PersonaPlex backend'sCommandBuilder) - Configures TCP socket probes on port 8998 (PersonaPlex uses WebSockets, not HTTP /health)
- Injects
HF_TOKENfrom a Kubernetes Secret andNO_TORCH_COMPILEenv var - Skips the model download init container (model downloads at startup via HF Hub)
- Requests 1 GPU with 32Gi memory
The result: real-time voice conversation running on a single RTX 5060 Ti, managed by the same operator that handles my llama.cpp text inference.
Built-in vLLM Runtime
vLLM is probably the most requested inference engine in the Kubernetes ecosystem. v0.6.0 ships it as a first-class runtime with typed VLLMConfig:
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: vllm-tinyllama
spec:
modelRef: tinyllama-1b
runtime: vllm
image: vllm/vllm-openai:cu130-nightly
skipModelInit: true
vllmConfig:
maxModelLen: 2048
dtype: float16
hfTokenSecretRef:
name: hf-token
key: HF_TOKEN
resources:
gpu: 1
memory: "8Gi"
The controller generates the right args (--model, --tensor-parallel-size, --max-model-len, --quantization, --dtype), configures HTTP /health probes on port 8000, and injects HF_TOKEN from a Secret. I tested this on my cluster with TinyLlama-1.1B and got a working OpenAI-compatible endpoint in under two minutes.
Built-in TGI Runtime
HuggingFace's Text Generation Inference also ships as a built-in runtime. TGI downloads models directly from HuggingFace Hub, so skipModelInit isn't even needed. The TGIConfig supports quantization methods (bitsandbytes, gptq, awq, eetq), max token limits, and dtype.
The Generic Runtime
Not every inference engine needs first-class support. The generic runtime lets you deploy any container:
spec:
runtime: generic
image: my-custom-server:latest
command: ["/app/serve"]
args: ["--port", "8080"]
containerPort: 8080
skipModelInit: true
probeOverrides:
startup:
tcpSocket:
port: 8080
failureThreshold: 60
You provide the image, args, probes, and env. The controller handles GPU scheduling, service creation, and lifecycle management.
Per-Runtime Autoscaling
Each runtime defines its default HPA metric via the HPAMetricProvider interface. When you enable autoscaling without specifying a metric, the controller picks the right one for your runtime:
- llama.cpp:
llamacpp:requests_processing - vLLM:
vllm:num_requests_running - TGI:
tgi:queue_size
No more hardcoded metric names.
Adding Your Own Runtime
docs/adding-a-runtime.md documents the full process: implement the RuntimeBackend interface, optionally add CommandBuilder, EnvBuilder, or HPAMetricProvider, register in the switch statement, add your CRD config struct, and run make manifests generate. The pattern is established with five working examples.
Everything Else in v0.6.0
- CUDA 13 default image for RTX 50-series and Qwen3.5 support
- Custom GPU layer splits for multi-GPU sharding
- Helm image registry/repository separation for air-gapped deployments
- Grafana inference metrics dashboard (tokens/sec, queue depth, KV cache, reconcile health)
-
imagePullSecretson InferenceService for private registries - HPA autoscaling for InferenceService
What's Next
Triton Inference Server and Ollama as built-in runtimes. Better Model controller support for non-GGUF formats (HuggingFace repo IDs as sources). And potentially Kubernetes-native voice AI pipelines combining PersonaPlex with LLMKube-managed reasoning models.
Top comments (0)