DEV Community

Cover image for llama.cpp on Kubernetes: The Guide I Wish Existed
Christopher Maher
Christopher Maher

Posted on

llama.cpp on Kubernetes: The Guide I Wish Existed

It started at my kitchen table.

I was spending an evening on my laptop, fascinated by how LLMs actually work under the hood. Not the API calls, not the chat interfaces, but the actual inference process. I installed Ollama on my Mac, pulled a model, and within a few hours I was completely hooked.

If you've done this yourself, you know the feeling. A language model running on your own hardware. No API keys, no usage limits, no data leaving your network. Just you and the model.

Ollama made it easy to get started, but I quickly wanted to understand what was happening underneath. That led me to llama.cpp, which Ollama uses under the hood, and that's where things really clicked. I could see exactly how the model was being loaded, how layers were offloaded to the GPU, how the inference loop worked. I went from curious to obsessed pretty quickly.

But then the questions started piling up.

How do I serve this to my team? How do I run multiple models? What happens when I want to use the NVIDIA GPUs on my Linux server AND the Metal GPU on my Mac? How do I monitor it? How do I manage model versions?

I come from a DevOps background, so my brain immediately went to Kubernetes. I figured someone had already built this. And while there are some incredible tools out there (Ollama for single-machine use, vLLM for high-throughput NVIDIA clusters), nothing quite did what I wanted: a Kubernetes operator that treats LLM inference as a first-class workload across heterogeneous hardware, including Apple Silicon.

So I started building LLMKube, an open-source Kubernetes operator for running LLMs with llama.cpp. I'm a big believer in open source, and I wanted this to be open source from day one. The best infrastructure tools are built by communities, not individuals. This guide is everything I've learned along the way.

What We're Building Toward

By the end of this guide, you'll understand how to:

  • Run llama.cpp on Kubernetes with proper lifecycle management
  • Deploy models with a single command or a two-resource YAML
  • Use NVIDIA GPUs with CUDA acceleration
  • Use Apple Silicon Macs as GPU inference nodes in your cluster
  • Split models across multiple GPUs for larger models
  • Monitor everything with Prometheus and Grafana

If you just want to try it out quickly, skip ahead to the hands-on quickstart.

The Problem with "Just Run llama.cpp"

llama.cpp is an outstanding project. It runs on virtually any hardware, supports dozens of model architectures, and the GGUF format has become the standard for local inference. If you need to run one model on one machine, llama.cpp with llama-server is honestly all you need.

The challenges show up when you want to operationalize it:

Model lifecycle. You need to download models, verify their integrity, cache them so pods don't re-download 30GB files on every restart, and keep track of what's deployed where.

GPU scheduling. If you have multiple models competing for limited GPU memory, you need something smarter than "first pod wins." Priority queues, memory budgets, and graceful handling of GPU contention all matter when you have real workloads.

Heterogeneous hardware. This is the big one. Apple Silicon's Metal GPU can't be accessed from inside a container. Every Kubernetes-based LLM tool I found either ignored Macs entirely or ran them in CPU-only mode, which throws away the best part of the hardware. If you have a Mac Studio with an M4 Ultra sitting on your desk and a Linux server with NVIDIA GPUs in your closet, you shouldn't have to choose between them.

Observability. If you're already running Prometheus and Grafana (and if you're running Kubernetes, you probably are), you want inference metrics in the same stack as everything else. Tokens per second, prompt processing time, GPU utilization, model load times, all in one place.

How LLMKube Approaches This

LLMKube adds two Custom Resource Definitions to your Kubernetes cluster:

Model defines what you want to run: the GGUF source URL, quantization level, GPU requirements, and hardware preferences.

InferenceService defines how you want to run it: replicas, resource limits, endpoint configuration, and which Model to reference.

The operator watches these resources and handles everything in between: downloading the model, creating deployments, configuring health checks, setting up llama-server with the right flags, exposing an OpenAI-compatible API, and cleaning up when you delete resources.

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: llama-3-8b
spec:
  source: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
  format: gguf
  quantization: Q4_K_M
  hardware:
    accelerator: cuda
    gpu:
      count: 1
---
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: llama-3-8b
spec:
  modelRef: llama-3-8b
  replicas: 1
  resources:
    cpu: "2"
    memory: "4Gi"
Enter fullscreen mode Exit fullscreen mode

That's it. The operator takes it from there.

My Actual Setup

I want to be transparent about the hardware I run this on, because I think it's important for people to see that you don't need datacenter-grade equipment to make this work.

Shadowstack is my primary inference server. It's a desktop PC I built specifically for this:

  • AMD Ryzen 9 7900X (12 cores / 24 threads)
  • 64GB DDR5-6000
  • 2x NVIDIA RTX 5060 Ti (16GB VRAM each, 32GB total)
  • Samsung 990 Pro 1TB NVMe
  • Running MicroK8s as a single-node Kubernetes cluster

Mac Studio (M4 Ultra, 36GB unified memory) runs the Metal Agent, which lets Kubernetes orchestrate llama-server natively on macOS with full Metal GPU access.

Mac Mini handles other orchestration workloads.

On Shadowstack, I run Qwen3 32B with the model split across both 5060 Tis using tensor parallelism. On the Mac Studio, I run Qwen 30B-A3B (a mixture-of-experts model that fits comfortably in 36GB of unified memory). Both are managed by the same LLMKube operator, using the same CRDs, visible through the same monitoring stack.

Is 36GB of unified memory on the Mac Studio less than I wish I had? Sure. But it still runs a 30B MoE model for real workloads, and that's the point. You work with the hardware you have.

The Metal Agent: Running Apple Silicon in Your Cluster

This is the part that gets me the most excited, and the part that I haven't seen anyone else solve.

Here's the core problem: Apple Silicon GPUs use Metal, not CUDA. Metal isn't accessible from inside a Docker container. So if you put a Mac in your Kubernetes cluster and deploy a pod to it, that pod can only use the CPU. Your M4 Ultra's GPU sits idle.

The Metal Agent works around this by inverting the typical Kubernetes model. Instead of running inference inside a container, the Metal Agent runs as a native macOS daemon that:

  1. Watches the Kubernetes API for InferenceService resources with accelerator: metal
  2. Spawns llama-server natively on macOS with full Metal GPU access
  3. Registers the endpoint back into Kubernetes so other services can route to it

From the perspective of any other service in your cluster, the model running on your Mac looks like any other Kubernetes-managed endpoint. You can hit the same OpenAI-compatible API, the same health checks work, the same Prometheus metrics are exposed.

# On your Mac
brew install llama.cpp
llmkube-metal-agent --host-ip 192.168.1.x

# From anywhere in the cluster
llmkube deploy qwen-30b-a3b --accelerator metal
Enter fullscreen mode Exit fullscreen mode

The same CRD that deploys a model on NVIDIA with CUDA deploys on Apple Silicon with Metal. Just change accelerator: cuda to accelerator: metal.

Multi-GPU: Splitting Models Across Cards

If you want to run models larger than what fits on a single GPU, llama.cpp supports tensor parallelism across multiple GPUs on the same node. LLMKube automates this through the GPU sharding spec.

On my Shadowstack box, Qwen3 32B (quantized to Q4_K_M, roughly 20GB) gets split across both 5060 Tis. Each GPU handles a portion of the model's layers, and llama.cpp coordinates the inference across both cards.

spec:
  hardware:
    accelerator: cuda
    gpu:
      count: 2
      sharding:
        strategy: layer
Enter fullscreen mode Exit fullscreen mode

The operator automatically calculates the tensor split ratios and passes the right flags to llama-server. On the dual 5060 Ti setup, I see consistent ~53 tokens/second for 3-8B models and solid performance on the 32B model with the split.

Hands-On: Try It in 10 Minutes

You don't need my hardware to try this. Here's the quickest path from zero to running inference on Kubernetes.

Prerequisites

  • A Kubernetes cluster (Minikube, kind, K3s, or any managed cluster)
  • kubectl configured
  • Helm 3

Install LLMKube

# Install the CLI
brew install defilantech/tap/llmkube

# Add the Helm repo and install the operator
helm repo add llmkube https://defilantech.github.io/LLMKube
helm install llmkube llmkube/llmkube \
  --namespace llmkube-system \
  --create-namespace
Enter fullscreen mode Exit fullscreen mode

Deploy Your First Model

# Deploy Phi-4 Mini (3.8B params, from the built-in catalog)
llmkube deploy phi-4-mini
Enter fullscreen mode Exit fullscreen mode

That single command creates both the Model and InferenceService resources. The operator downloads the GGUF file, spins up a pod with llama-server, and exposes an OpenAI-compatible API. You can also deploy any GGUF model by providing a --source URL pointing to HuggingFace or any HTTP endpoint.

Query It

# Port-forward and test
kubectl port-forward svc/phi-4-mini 8080:8080 &

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is Kubernetes in one sentence?"}
    ],
    "max_tokens": 100
  }'
Enter fullscreen mode Exit fullscreen mode

Use It With the OpenAI SDK

Since the API is OpenAI-compatible, you can point any OpenAI SDK client at it:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

This works with LangChain, LlamaIndex, and anything else that speaks the OpenAI API.

Add GPU Acceleration

If you have an NVIDIA GPU available in your cluster:

llmkube deploy llama-3.1-8b --gpu --gpu-count 1
Enter fullscreen mode Exit fullscreen mode

The difference is dramatic. On an NVIDIA L4 in GKE, prompt processing goes from 29 tok/s (CPU) to 1,026 tok/s (GPU). Token generation jumps from 4.6 tok/s to 64 tok/s. That's a 17x speedup on generation and 66x on prompt processing.

Air-Gapped Deployments

Early in my career, I worked in medical IT. That experience gave me an appreciation for environments where data simply cannot leave the network. Healthcare, defense, finance, government: these industries have strict compliance requirements that make cloud-hosted AI a non-starter.

LLMKube supports air-gapped deployment through PVC-based model sources with SHA256 integrity verification:

spec:
  source: pvc://model-storage/models/llama-3-8b-q4.gguf
  sha256: a1b2c3d4e5f6...
Enter fullscreen mode Exit fullscreen mode

You stage models to a PersistentVolumeClaim, provide the checksum, and the operator verifies integrity before deploying. No outbound network calls, no container registry pulls at runtime, no data leaving your network.

This is an area where I think llama.cpp really shines for Kubernetes deployments. The GGUF format is a single file. There's no Python dependency tree, no model sharding across dozens of files, no runtime downloads of tokenizers. You put one file on a PVC, point a CRD at it, and you're running.

Where LLMKube Fits (and Where It Doesn't)

I want to be honest about this, because there are great tools in this space and picking the right one matters.

If you need maximum throughput for high-concurrency workloads (50+ simultaneous users), use vLLM or SGLang. They use PagedAttention, continuous batching, and other optimizations that llama.cpp doesn't have. At scale, vLLM delivers significantly higher request throughput. That's just the reality.

If you just need to run one model on one machine, use Ollama. It's simpler, it's elegant, and it handles the single-machine case better than a Kubernetes operator ever will.

LLMKube is for the space in between. You have a Kubernetes cluster. You have a mix of hardware (maybe NVIDIA GPUs, maybe Apple Silicon, maybe both). You want Kubernetes-native lifecycle management with CRDs, GitOps workflows, and your inference metrics in the same Prometheus/Grafana stack as everything else. You care about air-gapped deployments, GPU scheduling, and model versioning. You're serving a team or a set of internal workloads, not a public-facing API with thousands of concurrent users.

If that sounds like your situation, LLMKube might be what you're looking for. If it doesn't, I genuinely hope one of the other tools solves your problem. We all benefit from this ecosystem getting better.

What's Next

LLMKube is open source (Apache 2.0) and actively developed. Some things I'm excited about on the roadmap:

  • Edge deployment support for lightweight Kubernetes distributions like K3s and MicroK8s
  • AMD GPU support (ROCm) with a community contributor already testing on Framework hardware with a Ryzen AI Max+ 395
  • llmkube chat for testing models directly from the CLI without needing curl

I'll be honest about one thing that comes up a lot: multi-node distributed inference. llama.cpp has an RPC backend that can split a model across machines over ethernet, and I've been watching it closely. The reality is that over consumer networking (1GbE, 2.5GbE), the performance hit from network round-trips makes it marginal for interactive use. Jeff Geerling tested a four-node Framework cluster and got 0.7 tok/s on Llama 405B. The tech is improving, but today my advice is to scale vertically first. Get a bigger GPU or more unified memory before trying to split across machines. If the RPC backend matures to the point where it's genuinely usable over ethernet, LLMKube will support it, but I'm not going to promise something that isn't ready.

If any of this is interesting to you, I'd love to hear from you. The project is at github.com/defilantech/LLMKube, and we have a Discord where I hang out and talk about this stuff regularly.

If you hit issues, open a GitHub issue. If you want to contribute, check the issues labeled good-first-issue. And if you just want to say hi, that's cool too.

Thanks for reading. I hope this saves you some of the time I spent figuring all this out.

Top comments (0)