LLMKube Now Deploys Any Inference Engine, Not Just llama.cpp

#kubernetes #ai #llm #opensource

LLMKube started as a Kubernetes operator for llama.cpp. You define a Model, define an InferenceService, and the controller handles GPU scheduling, health probes, model downloads, and Prometheus metrics. It works well for GGUF models.

But llama.cpp isn't the only inference engine. vLLM has PagedAttention. TGI has continuous batching. PersonaPlex does real-time voice AI. Triton serves multi-framework models. Locking the operator to one runtime limits what you can deploy.

v0.6.0 changes that with pluggable runtime backends.

The Problem

Before v0.6.0, the controller's constructDeployment() was hardcoded to llama.cpp. Container name, image, command-line args, health probes, model provisioning, everything assumed llama.cpp. If you wanted to deploy vLLM, you had to create a manual Kubernetes Deployment outside of LLMKube.

The Fix

A RuntimeBackend interface that each inference engine implements:

type RuntimeBackend interface {
    ContainerName() string
    DefaultImage() string
    DefaultPort() int32
    BuildArgs(isvc, model, modelPath, port) []string
    BuildProbes(port) (startup, liveness, readiness)
    NeedsModelInit() bool
}

The controller calls resolveBackend(isvc) based on the runtime field in the CRD, then delegates all container configuration to the backend. llama.cpp is the default. New runtimes register in a simple switch statement.

Testing It: PersonaPlex on Kubernetes

To prove the architecture works, I deployed NVIDIA's PersonaPlex on my home lab. PersonaPlex is a 7B speech-to-speech model based on Moshi. It listens and talks at the same time. Sub-300ms latency for interruptions. Completely different from llama.cpp: PyTorch runtime, WebSocket-based health checks, model downloaded via HuggingFace token.

The InferenceService CRD:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: personaplex
  namespace: voice-ai
spec:
  modelRef: personaplex-7b
  runtime: personaplex
  image: registry.defilan.net/personaplex:7b-v1-4bit-cuda13
  personaPlexConfig:
    quantize4Bit: true
    hfTokenSecretRef:
      name: hf-token
      key: HF_TOKEN
  endpoint:
    port: 8998
    type: NodePort
  resources:
    gpu: 1
    memory: "32Gi"

kubectl apply and it's running. The controller:

Sets the container command to python -m moshi.server (via the PersonaPlex backend's CommandBuilder)
Configures TCP socket probes on port 8998 (PersonaPlex uses WebSockets, not HTTP /health)
Injects HF_TOKEN from a Kubernetes Secret and NO_TORCH_COMPILE env var
Skips the model download init container (model downloads at startup via HF Hub)
Requests 1 GPU with 32Gi memory

The result: real-time voice conversation running on a single RTX 5060 Ti, managed by the same operator that handles my llama.cpp text inference.

Built-in vLLM Runtime

vLLM is probably the most requested inference engine in the Kubernetes ecosystem. v0.6.0 ships it as a first-class runtime with typed VLLMConfig:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: vllm-tinyllama
spec:
  modelRef: tinyllama-1b
  runtime: vllm
  image: vllm/vllm-openai:cu130-nightly
  skipModelInit: true
  vllmConfig:
    maxModelLen: 2048
    dtype: float16
    hfTokenSecretRef:
      name: hf-token
      key: HF_TOKEN
  resources:
    gpu: 1
    memory: "8Gi"

The controller generates the right args (--model, --tensor-parallel-size, --max-model-len, --quantization, --dtype), configures HTTP /health probes on port 8000, and injects HF_TOKEN from a Secret. I tested this on my cluster with TinyLlama-1.1B and got a working OpenAI-compatible endpoint in under two minutes.

Built-in TGI Runtime

HuggingFace's Text Generation Inference also ships as a built-in runtime. TGI downloads models directly from HuggingFace Hub, so skipModelInit isn't even needed. The TGIConfig supports quantization methods (bitsandbytes, gptq, awq, eetq), max token limits, and dtype.

The Generic Runtime

Not every inference engine needs first-class support. The generic runtime lets you deploy any container:

spec:
  runtime: generic
  image: my-custom-server:latest
  command: ["/app/serve"]
  args: ["--port", "8080"]
  containerPort: 8080
  skipModelInit: true
  probeOverrides:
    startup:
      tcpSocket:
        port: 8080
      failureThreshold: 60

You provide the image, args, probes, and env. The controller handles GPU scheduling, service creation, and lifecycle management.

Per-Runtime Autoscaling

Each runtime defines its default HPA metric via the HPAMetricProvider interface. When you enable autoscaling without specifying a metric, the controller picks the right one for your runtime:

llama.cpp: llamacpp:requests_processing
vLLM: vllm:num_requests_running
TGI: tgi:queue_size

No more hardcoded metric names.

Adding Your Own Runtime

docs/adding-a-runtime.md documents the full process: implement the RuntimeBackend interface, optionally add CommandBuilder, EnvBuilder, or HPAMetricProvider, register in the switch statement, add your CRD config struct, and run make manifests generate. The pattern is established with five working examples.

Everything Else in v0.6.0

CUDA 13 default image for RTX 50-series and Qwen3.5 support
Custom GPU layer splits for multi-GPU sharding
Helm image registry/repository separation for air-gapped deployments
Grafana inference metrics dashboard (tokens/sec, queue depth, KV cache, reconcile health)
imagePullSecrets on InferenceService for private registries
HPA autoscaling for InferenceService

What's Next

Triton Inference Server and Ollama as built-in runtimes. Better Model controller support for non-GGUF formats (HuggingFace repo IDs as sources). And potentially Kubernetes-native voice AI pipelines combining PersonaPlex with LLMKube-managed reasoning models.

GitHub: https://github.com/defilantech/llmkube