<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Christopher Maher</title>
    <description>The latest articles on DEV Community by Christopher Maher (@defilan).</description>
    <link>https://dev.to/defilan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828578%2Fd03de6fc-1dcb-419b-b336-0d9c7d86f7cc.jpeg</url>
      <title>DEV Community: Christopher Maher</title>
      <link>https://dev.to/defilan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/defilan"/>
    <language>en</language>
    <item>
      <title>LLMKube Now Deploys Any Inference Engine, Not Just llama.cpp</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Wed, 08 Apr 2026 01:03:15 +0000</pubDate>
      <link>https://dev.to/defilan/llmkube-now-deploys-any-inference-engine-not-just-llamacpp-fpm</link>
      <guid>https://dev.to/defilan/llmkube-now-deploys-any-inference-engine-not-just-llamacpp-fpm</guid>
      <description>&lt;p&gt;LLMKube started as a Kubernetes operator for llama.cpp. You define a Model, define an InferenceService, and the controller handles GPU scheduling, health probes, model downloads, and Prometheus metrics. It works well for GGUF models.&lt;/p&gt;

&lt;p&gt;But llama.cpp isn't the only inference engine. vLLM has PagedAttention. TGI has continuous batching. PersonaPlex does real-time voice AI. Triton serves multi-framework models. Locking the operator to one runtime limits what you can deploy.&lt;/p&gt;

&lt;p&gt;v0.6.0 changes that with pluggable runtime backends.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Before v0.6.0, the controller's &lt;code&gt;constructDeployment()&lt;/code&gt; was hardcoded to llama.cpp. Container name, image, command-line args, health probes, model provisioning, everything assumed llama.cpp. If you wanted to deploy vLLM, you had to create a manual Kubernetes Deployment outside of LLMKube.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;RuntimeBackend&lt;/code&gt; interface that each inference engine implements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;RuntimeBackend&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ContainerName&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;DefaultImage&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;DefaultPort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;
    &lt;span class="n"&gt;BuildArgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isvc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;modelPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;BuildProbes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;liveness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;readiness&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;NeedsModelInit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller calls &lt;code&gt;resolveBackend(isvc)&lt;/code&gt; based on the &lt;code&gt;runtime&lt;/code&gt; field in the CRD, then delegates all container configuration to the backend. llama.cpp is the default. New runtimes register in a simple switch statement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing It: PersonaPlex on Kubernetes
&lt;/h2&gt;

&lt;p&gt;To prove the architecture works, I deployed NVIDIA's PersonaPlex on my home lab. PersonaPlex is a 7B speech-to-speech model based on Moshi. It listens and talks at the same time. Sub-300ms latency for interruptions. Completely different from llama.cpp: PyTorch runtime, WebSocket-based health checks, model downloaded via HuggingFace token.&lt;/p&gt;

&lt;p&gt;The InferenceService CRD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;voice-ai&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex-7b&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.defilan.net/personaplex:7b-v1-4bit-cuda13&lt;/span&gt;
  &lt;span class="na"&gt;personaPlexConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;quantize4Bit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;hfTokenSecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hf-token&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_TOKEN&lt;/span&gt;
  &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8998&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NodePort&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kubectl apply&lt;/code&gt; and it's running. The controller:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sets the container command to &lt;code&gt;python -m moshi.server&lt;/code&gt; (via the PersonaPlex backend's &lt;code&gt;CommandBuilder&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Configures TCP socket probes on port 8998 (PersonaPlex uses WebSockets, not HTTP /health)&lt;/li&gt;
&lt;li&gt;Injects &lt;code&gt;HF_TOKEN&lt;/code&gt; from a Kubernetes Secret and &lt;code&gt;NO_TORCH_COMPILE&lt;/code&gt; env var&lt;/li&gt;
&lt;li&gt;Skips the model download init container (model downloads at startup via HF Hub)&lt;/li&gt;
&lt;li&gt;Requests 1 GPU with 32Gi memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: real-time voice conversation running on a single RTX 5060 Ti, managed by the same operator that handles my llama.cpp text inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-in vLLM Runtime
&lt;/h2&gt;

&lt;p&gt;vLLM is probably the most requested inference engine in the Kubernetes ecosystem. v0.6.0 ships it as a first-class runtime with typed &lt;code&gt;VLLMConfig&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-tinyllama&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tinyllama-1b&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm/vllm-openai:cu130-nightly&lt;/span&gt;
  &lt;span class="na"&gt;skipModelInit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;vllmConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;maxModelLen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;
    &lt;span class="na"&gt;dtype&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;float16&lt;/span&gt;
    &lt;span class="na"&gt;hfTokenSecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hf-token&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_TOKEN&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller generates the right args (&lt;code&gt;--model&lt;/code&gt;, &lt;code&gt;--tensor-parallel-size&lt;/code&gt;, &lt;code&gt;--max-model-len&lt;/code&gt;, &lt;code&gt;--quantization&lt;/code&gt;, &lt;code&gt;--dtype&lt;/code&gt;), configures HTTP &lt;code&gt;/health&lt;/code&gt; probes on port 8000, and injects HF_TOKEN from a Secret. I tested this on my cluster with TinyLlama-1.1B and got a working OpenAI-compatible endpoint in under two minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-in TGI Runtime
&lt;/h2&gt;

&lt;p&gt;HuggingFace's Text Generation Inference also ships as a built-in runtime. TGI downloads models directly from HuggingFace Hub, so &lt;code&gt;skipModelInit&lt;/code&gt; isn't even needed. The &lt;code&gt;TGIConfig&lt;/code&gt; supports quantization methods (bitsandbytes, gptq, awq, eetq), max token limits, and dtype.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Generic Runtime
&lt;/h2&gt;

&lt;p&gt;Not every inference engine needs first-class support. The &lt;code&gt;generic&lt;/code&gt; runtime lets you deploy any container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;generic&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-custom-server:latest&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/app/serve"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;skipModelInit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;probeOverrides&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;startup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tcpSocket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You provide the image, args, probes, and env. The controller handles GPU scheduling, service creation, and lifecycle management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Runtime Autoscaling
&lt;/h2&gt;

&lt;p&gt;Each runtime defines its default HPA metric via the &lt;code&gt;HPAMetricProvider&lt;/code&gt; interface. When you enable autoscaling without specifying a metric, the controller picks the right one for your runtime:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;llama.cpp: &lt;code&gt;llamacpp:requests_processing&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;vLLM: &lt;code&gt;vllm:num_requests_running&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;TGI: &lt;code&gt;tgi:queue_size&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No more hardcoded metric names.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding Your Own Runtime
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;docs/adding-a-runtime.md&lt;/code&gt; documents the full process: implement the &lt;code&gt;RuntimeBackend&lt;/code&gt; interface, optionally add &lt;code&gt;CommandBuilder&lt;/code&gt;, &lt;code&gt;EnvBuilder&lt;/code&gt;, or &lt;code&gt;HPAMetricProvider&lt;/code&gt;, register in the switch statement, add your CRD config struct, and run &lt;code&gt;make manifests generate&lt;/code&gt;. The pattern is established with five working examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Everything Else in v0.6.0
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CUDA 13 default image for RTX 50-series and Qwen3.5 support&lt;/li&gt;
&lt;li&gt;Custom GPU layer splits for multi-GPU sharding&lt;/li&gt;
&lt;li&gt;Helm image registry/repository separation for air-gapped deployments&lt;/li&gt;
&lt;li&gt;Grafana inference metrics dashboard (tokens/sec, queue depth, KV cache, reconcile health)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;imagePullSecrets&lt;/code&gt; on InferenceService for private registries&lt;/li&gt;
&lt;li&gt;HPA autoscaling for InferenceService&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Triton Inference Server and Ollama as built-in runtimes. Better Model controller support for non-GGUF formats (HuggingFace repo IDs as sources). And potentially Kubernetes-native voice AI pipelines combining PersonaPlex with LLMKube-managed reasoning models.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;https://github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>I tested speculative decoding on my home GPU cluster. Here's why it didn't help.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 06 Apr 2026 03:51:51 +0000</pubDate>
      <link>https://dev.to/defilan/i-tested-speculative-decoding-on-my-home-gpu-cluster-heres-why-it-didnt-help-3ej6</link>
      <guid>https://dev.to/defilan/i-tested-speculative-decoding-on-my-home-gpu-cluster-heres-why-it-didnt-help-3ej6</guid>
      <description>&lt;p&gt;I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel.&lt;/p&gt;

&lt;p&gt;I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp.&lt;/p&gt;

&lt;p&gt;For this test I deployed two models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4 26B-A4B&lt;/strong&gt;: Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs at 88 tok/s on my setup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3-32B&lt;/strong&gt;: A dense 32B model. All parameters active per token. Runs at 20 tok/s.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both running Q4_K_M quantization, flash attention enabled, 8K context, split across both GPUs.&lt;/p&gt;

&lt;p&gt;Quick note on why the MoE model is so much faster: Gemma 4 only activates a fraction of its parameters per token, so there's way less weight data to read from VRAM on each forward pass. MoE routing overhead eats into some of that advantage, but it's still a huge win on bandwidth-constrained hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tested
&lt;/h2&gt;

&lt;p&gt;llama.cpp has built-in n-gram speculative decoding. No draft model needed, you just pass a few flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--spec-type&lt;/span&gt; ngram-mod
&lt;span class="nt"&gt;--draft-max&lt;/span&gt; 64
&lt;span class="nt"&gt;--draft-min&lt;/span&gt; 48
&lt;span class="nt"&gt;--spec-ngram-size-n&lt;/span&gt; 24
&lt;span class="nt"&gt;--spec-ngram-size-m&lt;/span&gt; 48
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How it works: llama.cpp builds an n-gram lookup table from the recent context (both the input prompt and generated output so far). When it spots a pattern it's seen before, it speculatively drafts the next several tokens and verifies them in a single forward pass. If the predictions are right, you get multiple tokens for the cost of one.&lt;/p&gt;

&lt;p&gt;Important: this is specifically n-gram speculative decoding, not draft-model approaches like EAGLE-3 or Medusa. Those use a separate trained model to generate speculations. N-gram lookup is simpler and doesn't require any extra model files.&lt;/p&gt;

&lt;p&gt;With LLMKube, switching between configs is just updating the &lt;code&gt;extraArgs&lt;/code&gt; field in the InferenceService CRD and letting the operator restart the pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b-a4b&lt;/span&gt;
  &lt;span class="na"&gt;extraArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--spec-type"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ngram-mod"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--draft-max"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I tested two variants: &lt;code&gt;ngram-simple&lt;/code&gt; (basic lookup) and &lt;code&gt;ngram-mod&lt;/code&gt; (the variant recommended for MoE models in the llama.cpp docs).&lt;/p&gt;

&lt;h2&gt;
  
  
  The result that fooled me
&lt;/h2&gt;

&lt;p&gt;My first test ran the same prompt 10 times in a row. The numbers looked incredible:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;tok/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (cold)&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;105.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;112.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;186.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;336.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;419.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Almost 5x speedup by run 10. I was ready to write a very different article.&lt;/p&gt;

&lt;p&gt;Then I ran 8 different prompts. Code generation, API design, Go functions, bash scripts, technical explanations. Real variety.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Baseline (tok/s)&lt;/th&gt;
&lt;th&gt;+ ngram-mod (tok/s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BST implementation&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;94.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K8s operator explanation&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU monitoring script&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;87.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;REST API design&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GGUF parser in Go&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallelism explainer&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Benchmark script&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helm chart design&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Median&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88.2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Zero improvement. The 419 tok/s "speedup" was the n-gram cache memorizing repeated output patterns. With diverse prompts, there's nothing useful to cache.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same story on the dense model
&lt;/h2&gt;

&lt;p&gt;Qwen3-32B showed the same pattern. 20.4 tok/s baseline, 20.6 tok/s with ngram-simple. Within measurement noise.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;+ ngram-simple&lt;/th&gt;
&lt;th&gt;+ ngram-mod&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;87.2 (-1.2%)&lt;/td&gt;
&lt;td&gt;88.2 (0%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-32B&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;20.4&lt;/td&gt;
&lt;td&gt;20.6 (+1%)&lt;/td&gt;
&lt;td&gt;not tested&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why it doesn't help on these GPUs
&lt;/h2&gt;

&lt;p&gt;The bottleneck on RTX 5060 Ti is memory bandwidth, not compute. Every token requires reading model weights from VRAM. Speculative decoding tries to batch multiple verification steps together, but when you're already saturating the memory bus during single-token generation, there's not enough idle compute for the speculative verification to pay for itself.&lt;/p&gt;

&lt;p&gt;This is different from high-end datacenter GPUs (A100, H100) where the compute-to-memory bandwidth ratio is much higher. An H100 has roughly 3,350 GB/s memory bandwidth but nearly 2,000 TFLOPS of FP16 compute. That ratio means there's genuine idle compute at small batch sizes that speculative decoding can exploit. Consumer GPUs don't have that same headroom.&lt;/p&gt;

&lt;p&gt;For MoE models specifically, there's an additional wrinkle. Each speculative token in a verification batch may activate different experts, which means more expert weight blocks need to be read. This reduces the batching advantage that speculative decoding relies on in dense models, where weight reads stay roughly constant regardless of batch size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caveat:&lt;/strong&gt; there are scenarios where n-gram spec decoding can help even on consumer hardware. If your model is partially CPU-offloaded (doesn't fit in VRAM), the PCIe bandwidth bottleneck is severe enough that speculative batching can provide real gains. And for highly repetitive or templated outputs (think structured JSON, boilerplate code), the n-gram cache hit rate goes way up. My testing focused on single-user inference with fully VRAM-resident models and diverse prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about EAGLE-3?
&lt;/h2&gt;

&lt;p&gt;I originally wanted to test EAGLE-3, which uses a trained draft head instead of n-gram lookup. Three problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No EAGLE-3 draft model exists for Gemma 4 (no one has trained one)&lt;/li&gt;
&lt;li&gt;The llama.cpp EAGLE-3 PR (#18039) is still open and in draft as of April 5, 2026&lt;/li&gt;
&lt;li&gt;The PR's own benchmarks show MoE models getting roughly 0.89-1.06x on certain prompts, with some actually slower due to the expert activation overhead during batch verification&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even with a trained draft head, the fundamental bandwidth constraint on consumer GPUs would remain.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually helps on consumer GPUs
&lt;/h2&gt;

&lt;p&gt;If you're running local LLMs on consumer hardware, here's what actually moves the needle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flash attention&lt;/strong&gt;: Already standard, significant memory savings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV cache quantization&lt;/strong&gt;: q4_0 or q8_0 reduces cache memory pressure without meaningful quality loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoE over dense&lt;/strong&gt;: Gemma 4 activates ~4B parameters per token vs Qwen3-32B's 32B. That's the primary driver of the throughput difference, though MoE routing overhead means the speedup isn't a clean 8x ratio.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-GPU split&lt;/strong&gt;: Doubles your available memory bandwidth, which is the actual bottleneck&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context size tuning&lt;/strong&gt;: Smaller context = less KV cache = more VRAM headroom&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The benchmarking lesson
&lt;/h2&gt;

&lt;p&gt;The biggest takeaway wasn't about speculative decoding. It was about benchmark methodology.&lt;/p&gt;

&lt;p&gt;If I'd only tested with repeated prompts, I would have reported a 4.75x speedup and been completely wrong. The n-gram cache is doing something real, but only in a narrow scenario where outputs are highly repetitive or templated. For interactive chat, coding assistance, or any workload with diverse inputs, it provides no benefit on this hardware.&lt;/p&gt;

&lt;p&gt;Be skeptical of speculative decoding benchmarks that don't disclose their prompt diversity. And if you see someone reporting huge n-gram gains, check if they're running the same prompt over and over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;Everything I tested runs on Kubernetes via LLMKube. The InferenceService CRD's &lt;code&gt;extraArgs&lt;/code&gt; field makes it trivial to swap between configs without touching your deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-spec-bench&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b-a4b&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/ggml-org/llama.cpp:server-cuda&lt;/span&gt;
  &lt;span class="na"&gt;contextSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8192&lt;/span&gt;
  &lt;span class="na"&gt;flashAttention&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;extraArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--spec-type"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ngram-mod"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--draft-max"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64"&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLMKube is open source, Apache 2.0: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>kubernetes</category>
      <category>gpu</category>
      <category>ai</category>
    </item>
    <item>
      <title>Google Released Gemma 4 Yesterday. I Had It Fixing Real Bugs by Lunch.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Fri, 03 Apr 2026 16:34:48 +0000</pubDate>
      <link>https://dev.to/defilan/google-released-gemma-4-yesterday-i-had-it-fixing-real-bugs-by-lunch-cp0</link>
      <guid>https://dev.to/defilan/google-released-gemma-4-yesterday-i-had-it-fixing-real-bugs-by-lunch-cp0</guid>
      <description>&lt;p&gt;Google released Gemma 4 yesterday. By the time I went to bed, I had it deployed on my home lab, running real coding benchmarks at 96 tokens per second.&lt;/p&gt;

&lt;p&gt;The catch: no official llama.cpp image supported the &lt;code&gt;gemma4&lt;/code&gt; architecture yet. The stock CUDA images crash with &lt;code&gt;unknown model architecture: 'gemma4'&lt;/code&gt;. So I built it from source, on the same Kubernetes cluster that serves inference.&lt;/p&gt;

&lt;p&gt;This post is about what it took to go from "model dropped" to "running in production" in about two hours on consumer hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;My home inference server (I call it ShadowStack):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x NVIDIA RTX 5060 Ti (16GB each, 32GB total VRAM)&lt;/li&gt;
&lt;li&gt;AMD Ryzen 9 7900X, 64GB DDR5&lt;/li&gt;
&lt;li&gt;Ubuntu 24.04, MicroK8s&lt;/li&gt;
&lt;li&gt;NVIDIA driver 590.48.01 (CUDA 13.1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is managed by &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt;, a Kubernetes operator I built for running llama.cpp inference. One CRD to define the model, one CRD to define the service, the operator handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: The Architecture Problem
&lt;/h2&gt;

&lt;p&gt;First attempt, I tried the &lt;code&gt;server-cuda13&lt;/code&gt; image (CUDA 13 build of llama.cpp):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Gemma 4 architecture hadn't shipped in any released llama.cpp build yet. The support was only in HEAD.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Build From HEAD On-Cluster
&lt;/h2&gt;

&lt;p&gt;I have a Kaniko build pipeline on the cluster from a previous project (TurboQuant benchmarking). I wrote a Dockerfile that clones llama.cpp HEAD and builds with CUDA targeting SM 86 (Ampere) and SM 120 (Blackwell):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;nvidia/cuda:12.8.0-devel-ubuntu24.04&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 https://github.com/ggml-org/llama.cpp.git /build/llama.cpp

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /build/llama.cpp&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /usr/local/cuda/lib64/stubs/libcuda.so &lt;span class="se"&gt;\
&lt;/span&gt;          /usr/local/cuda/lib64/stubs/libcuda.so.1
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"86;120"&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--target&lt;/span&gt; llama-server &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Kaniko Job on the cluster built this in about 15 minutes and pushed it to my local container registry. The same cluster that runs inference also builds its own inference server. No external CI needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Deploy
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llmkube deploy gemma4-26b &lt;span class="nt"&gt;--gpu&lt;/span&gt; &lt;span class="nt"&gt;--accelerator&lt;/span&gt; cuda &lt;span class="nt"&gt;--gpu-count&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; https://huggingface.co/Trilogix1/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; registry.defilan.net/llama-server-latest:gemma4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="nt"&gt;--jinja&lt;/span&gt; &lt;span class="nt"&gt;--context&lt;/span&gt; 32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model is 15.6 GB at Q4_K_M. With both GPUs, that leaves about 16 GB for KV cache. Plenty for 32K context.&lt;/p&gt;

&lt;p&gt;The operator downloaded the model, created the Deployment with the right GPU flags, set up health probes, and exposed an OpenAI-compatible endpoint. From the deploy command to the first inference request was about 3 minutes (mostly model download time).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Single Request
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation&lt;/td&gt;
&lt;td&gt;96 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt processing&lt;/td&gt;
&lt;td&gt;128 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model size (Q4_K_M)&lt;/td&gt;
&lt;td&gt;15.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active parameters per token&lt;/td&gt;
&lt;td&gt;4B (MoE)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Under Load (4 concurrent workers, 2 minutes)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate throughput&lt;/td&gt;
&lt;td&gt;170 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total requests&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rate&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P50 latency&lt;/td&gt;
&lt;td&gt;~2s per request&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For context, the generic benchmarks floating around say Gemma 4 26B-A4B "exceeds 40 tok/s on consumer hardware." We're doing 96 tok/s on a single request and 170 tok/s aggregate under concurrent load. The dual-GPU split and the MoE architecture (only 4B parameters active per token) make this model surprisingly fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Coding Benchmarks
&lt;/h2&gt;

&lt;p&gt;I didn't just run "hello world" tests. I fed it actual bug reports from my own project and asked it to generate fixes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug: GPU Rolling Update Deadlock
&lt;/h3&gt;

&lt;p&gt;The issue: Kubernetes rolling updates deadlock on GPU workloads because the new pod can't schedule (old pod holds GPUs) and the old pod won't terminate (waiting for new pod to be Ready).&lt;/p&gt;

&lt;p&gt;Gemma 4's response: correctly identified that GPU workloads should use &lt;code&gt;Recreate&lt;/code&gt; strategy instead of &lt;code&gt;RollingUpdate&lt;/code&gt;, with a conditional check on GPU count. Showed the chain-of-thought reasoning, considered edge cases, and verified against the pattern before outputting.&lt;/p&gt;

&lt;p&gt;Time: 10.6 seconds for a 1024-token response including the full reasoning chain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug: Stale Endpoints After Deletion
&lt;/h3&gt;

&lt;p&gt;The issue: deleting an InferenceService leaves orphaned Kubernetes Endpoints.&lt;/p&gt;

&lt;p&gt;Gemma 4's response: generated a complete &lt;code&gt;UnregisterEndpoint&lt;/code&gt; method with DNS name sanitization, Service and Endpoints deletion, &lt;code&gt;NotFound&lt;/code&gt; error handling, and logging. Production-quality Go code on the first try.&lt;/p&gt;

&lt;p&gt;Time: 11.1 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Generation: Ginkgo BDD Tests
&lt;/h3&gt;

&lt;p&gt;I asked it to write tests following an existing pattern in the codebase. It generated 4 correct test cases with &lt;code&gt;BeforeEach&lt;/code&gt; setup, proper assertions, and the right Gomega matchers. Used &lt;code&gt;ContainElements&lt;/code&gt; for present checks and &lt;code&gt;NotTo(ContainElement())&lt;/code&gt; for absent checks, matching the exact conventions from the rest of the test suite.&lt;/p&gt;

&lt;p&gt;Time: 12.3 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;I'm not claiming Gemma 4 replaces Claude or GPT-4. It doesn't. The reasoning is shallower on complex multi-step problems, and it occasionally cuts off mid-response at the token limit.&lt;/p&gt;

&lt;p&gt;What I am claiming: the gap between "Google releases a new model" and "it's running on your hardware fixing real bugs" has shrunk to hours, not weeks. The pieces are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GGUF quantization appears on HuggingFace within hours of a model release&lt;/li&gt;
&lt;li&gt;llama.cpp HEAD usually has architecture support on day one (the tokenizer and template fixes were already committed)&lt;/li&gt;
&lt;li&gt;Kaniko or similar tools let you build from source on-cluster without a separate CI pipeline&lt;/li&gt;
&lt;li&gt;A Kubernetes operator (in my case, LLMKube) lets you deploy with one command and get health checks, metrics, and an OpenAI-compatible API&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the same workflow regardless of whether the model is Gemma 4, Qwen3.5, Llama, or whatever ships next week. The infrastructure is model-agnostic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardware Math
&lt;/h2&gt;

&lt;p&gt;This entire setup cost about $2,400:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x RTX 5060 Ti: ~$800&lt;/li&gt;
&lt;li&gt;Ryzen 9 7900X + motherboard + RAM + SSD + case + PSU: ~$1,600&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running 24/7, the system draws about 50-60W idle and 500-600W under full inference load. At $0.12/kWh, that's roughly $30-50/month in electricity for unlimited inference.&lt;/p&gt;

&lt;p&gt;Compare to API costs: at OpenAI's pricing for a comparable model, 110 requests in 2 minutes would cost roughly $5-10. Scale that to continuous use and the hardware pays for itself in a month or two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;LLMKube is open source (Apache 2.0): &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have a GPU and a Kubernetes cluster (even a single-node K3s or MicroK8s), you can deploy any GGUF model with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;llmkube llmkube/llmkube
llmkube deploy llama-3.1-8b &lt;span class="nt"&gt;--gpu&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Gemma 4 specifically, you'll need a custom llama.cpp image until the official builds ship with &lt;code&gt;gemma4&lt;/code&gt; architecture support. The Dockerfile above works.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Benchmarks run on April 2, 2026 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.1, driver 590.48.01). Gemma 4 26B-A4B-it Q4_K_M via llama.cpp built from HEAD commit f851fa5a.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>llm</category>
      <category>homelab</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Tested TurboQuant KV Cache Compression on Consumer GPUs. Here's What Actually Happened.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 30 Mar 2026 15:12:24 +0000</pubDate>
      <link>https://dev.to/defilan/i-tested-turboquant-kv-cache-compression-on-consumer-gpus-heres-what-actually-happened-beg</link>
      <guid>https://dev.to/defilan/i-tested-turboquant-kv-cache-compression-on-consumer-gpus-heres-what-actually-happened-beg</guid>
      <description>&lt;p&gt;I spent this weekend testing TurboQuant KV cache compression on my home lab Kubernetes cluster. The paper (ICLR 2026, Google Research) promises up to 4.57x compression of the KV cache with minimal quality loss. That sounded like exactly what I needed. I'm always bumping up against VRAM limits trying to run larger models or longer contexts on consumer hardware.&lt;/p&gt;

&lt;p&gt;Here's what I found: it works, but there are real tradeoffs nobody's talking about yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: KV Cache Eats Your VRAM
&lt;/h2&gt;

&lt;p&gt;If you've run LLMs locally, you know the drill. You load a 32B model that fits in 20GB of VRAM, set the context to 32K, and suddenly you're at 28GB. The model weights didn't change. It's the KV cache growing linearly with context length.&lt;/p&gt;

&lt;p&gt;For every token in the context, the model stores key and value vectors for every attention head at every layer. In FP16, that adds up fast. A 32B model at 32K context can burn through 8+ GB of VRAM just for the KV cache.&lt;/p&gt;

&lt;p&gt;TurboQuant's approach is to apply a Walsh-Hadamard Transform (WHT) rotation to KV cache vectors before quantizing them to 3 bits. The rotation "gaussianizes" the distribution, making scalar quantization much more effective. The result is TQ3_0: roughly 3 bits per element instead of 16, for a theoretical 4.57x compression.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;: ShadowStack, my home inference server&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x NVIDIA RTX 5060 Ti (16GB GDDR7 each, 32GB total)&lt;/li&gt;
&lt;li&gt;AMD Ryzen 9 7900X, 64GB DDR5&lt;/li&gt;
&lt;li&gt;Ubuntu 24.04, MicroK8s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software&lt;/strong&gt;: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt;, an open-source Kubernetes operator I built for managing llama.cpp inference workloads. It handles model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics through Kubernetes CRDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TurboQuant build&lt;/strong&gt;: I used the &lt;a href="https://github.com/animehacker/llama-turboquant" rel="noopener noreferrer"&gt;animehacker/llama-turboquant&lt;/a&gt; fork, which has working CUDA kernels for the WHT-based TQ3_0 type. This is a Stage 1 implementation (no QJL residual correction from the full paper). I built it with Kaniko directly on my cluster targeting SM 86 (Ampere) and SM 120 (Blackwell).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Wrapper Entrypoint Pattern
&lt;/h3&gt;

&lt;p&gt;LLMKube's InferenceService CRD doesn't have a &lt;code&gt;--cache-type&lt;/code&gt; flag yet, so I built a custom Docker image with a wrapper entrypoint that injects the TurboQuant flags transparently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# entrypoint.sh - passes through all LLMKube args, appends TQ flags&lt;/span&gt;
&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;tq3_0&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;true&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;llama-server &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else
    &lt;/span&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;llama-server &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using &lt;code&gt;exec&lt;/code&gt; is important. It makes llama-server PID 1 so Kubernetes health probes and signal handling work correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Methodology
&lt;/h2&gt;

&lt;p&gt;Apples-to-apples. Same model weights, same context size, same concurrency. The only variable was the KV cache type (FP16 vs TQ3_0). Flash attention was enabled for all tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput test&lt;/strong&gt;: 5 minutes of sustained load at 4 concurrent requests, 8K context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context sweep&lt;/strong&gt;: Deploy at each context size (4K through 131K), run a 2-minute stress test, record VRAM via nvidia-smi.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models tested&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 8B (Q5_K_M), small model with lots of headroom&lt;/li&gt;
&lt;li&gt;Qwen 2.5 14B (Q5_K_M), medium model that fills one GPU&lt;/li&gt;
&lt;li&gt;Qwen 2.5 32B (Q4_K_M), large model that requires both GPUs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results: Throughput
&lt;/h2&gt;

&lt;p&gt;This is where TurboQuant hurts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Gen tok/s&lt;/th&gt;
&lt;th&gt;Prompt tok/s&lt;/th&gt;
&lt;th&gt;Requests (5min)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 8B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;565.5&lt;/td&gt;
&lt;td&gt;771&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 8B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 14B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;122.0&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 14B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;63.4&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 32B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;133.3&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 32B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.5&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Generation throughput dropped 5-6x across all models. Prompt processing dropped roughly 2-6x depending on model size. This is consistent with what the PR benchmarks showed on CPU, but I expected Blackwell's tensor cores to help more than they did. The animehacker CUDA kernels were optimized for Ampere (SM 86), not Blackwell (SM 120), so there's likely performance left on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: VRAM Usage
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Llama 3.1 8B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;6.4 GB&lt;/td&gt;
&lt;td&gt;10.1 GB&lt;/td&gt;
&lt;td&gt;-58% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;6.9 GB&lt;/td&gt;
&lt;td&gt;14.3 GB&lt;/td&gt;
&lt;td&gt;-107% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;8.0 GB&lt;/td&gt;
&lt;td&gt;22.8 GB&lt;/td&gt;
&lt;td&gt;-185% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;10.1 GB&lt;/td&gt;
&lt;td&gt;6.9 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;14.3 GB&lt;/td&gt;
&lt;td&gt;8.4 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;98K&lt;/td&gt;
&lt;td&gt;18.5 GB&lt;/td&gt;
&lt;td&gt;9.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;47% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;131K&lt;/td&gt;
&lt;td&gt;22.7 GB&lt;/td&gt;
&lt;td&gt;11.2 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 14B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;11.1 GB&lt;/td&gt;
&lt;td&gt;16.7 GB&lt;/td&gt;
&lt;td&gt;-50% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;11.9 GB&lt;/td&gt;
&lt;td&gt;23.0 GB&lt;/td&gt;
&lt;td&gt;-93% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;13.4 GB&lt;/td&gt;
&lt;td&gt;11.0 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;16.6 GB&lt;/td&gt;
&lt;td&gt;11.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;29% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;22.8 GB&lt;/td&gt;
&lt;td&gt;13.7 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 32B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2K&lt;/td&gt;
&lt;td&gt;19.9 GB&lt;/td&gt;
&lt;td&gt;23.7 GB&lt;/td&gt;
&lt;td&gt;-19% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;20.5 GB&lt;/td&gt;
&lt;td&gt;27.9 GB&lt;/td&gt;
&lt;td&gt;-36% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;21.6 GB&lt;/td&gt;
&lt;td&gt;19.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;23.7 GB&lt;/td&gt;
&lt;td&gt;20.3 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;28.0 GB&lt;/td&gt;
&lt;td&gt;21.4 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Surprise: TQ Uses MORE VRAM at Small Contexts
&lt;/h2&gt;

&lt;p&gt;I wasn't expecting this. At 4K-16K context, TQ3_0 consistently used more VRAM than the FP16 baseline. Sometimes dramatically more. Llama 8B at 16K context used 22.8 GB with TQ vs 8.0 GB with FP16.&lt;/p&gt;

&lt;p&gt;My theory: the WHT rotation machinery has a fixed overhead (lookup tables, rotation matrices, codebooks) that gets allocated regardless of context size. When the KV cache is small, this overhead dwarfs the compression savings. The crossover point where TQ starts winning varies by model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama 8B: around 32K context&lt;/li&gt;
&lt;li&gt;Qwen 14B: around 16K context&lt;/li&gt;
&lt;li&gt;Qwen 32B: around 8K context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Larger models cross over earlier because their per-token KV cache is larger (more layers, more attention heads), so the compression pays off sooner.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Is TurboQuant Worth It?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use TQ3_0 when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need 32K+ context on consumer GPUs&lt;/li&gt;
&lt;li&gt;You're hitting VRAM limits and can't afford more hardware&lt;/li&gt;
&lt;li&gt;Throughput isn't critical (batch processing, RAG with long documents, analysis tasks)&lt;/li&gt;
&lt;li&gt;You're running a large model (32B+) where the crossover point is lower&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Don't use TQ3_0 when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context is under 16K (you'll actually use more VRAM)&lt;/li&gt;
&lt;li&gt;You need interactive throughput (the 5x penalty makes chat unusable)&lt;/li&gt;
&lt;li&gt;You're on Blackwell and want optimal performance (wait for SM 120-optimized kernels)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sweet spot in my testing was Qwen 32B at 32K context. Baseline uses 28 GB, which is dangerously close to my 32 GB ceiling. One concurrent request could OOM. TQ drops it to 21.4 GB, leaving over 10 GB of headroom for parallel slots or longer contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The throughput penalty is the main blocker. The animehacker CUDA kernels use a fused MMVQ approach that avoids dequantization during attention, but the WHT butterfly transform still runs 160 integer ops per element in registers. On Blackwell with its new SM architecture, these kernels likely aren't hitting optimal occupancy.&lt;/p&gt;

&lt;p&gt;Things I'm watching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/ggml-org/llama.cpp/pull/21089" rel="noopener noreferrer"&gt;PR #21089&lt;/a&gt; on ggml-org/llama.cpp, the only open upstream PR for TurboQuant (CPU-only for now)&lt;/li&gt;
&lt;li&gt;Whether &lt;code&gt;ggerganov&lt;/code&gt; engages with it. If he requests changes rather than closing, it'll eventually land.&lt;/li&gt;
&lt;li&gt;SM 120-optimized CUDA kernels. Blackwell has new instructions that could close the throughput gap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For LLMKube, I'm planning to add &lt;code&gt;cacheTypeK&lt;/code&gt; and &lt;code&gt;cacheTypeV&lt;/code&gt; fields to the InferenceService CRD so users can configure this without the wrapper entrypoint hack. Also an &lt;code&gt;extraArgs&lt;/code&gt; escape hatch for any llama.cpp flag we don't have a typed field for yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;All the benchmarking infrastructure is in the &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt; repo. The operator is open source (Apache 2.0) and handles the full lifecycle: model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics. If you have a GPU cluster and want to test TurboQuant:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build the custom image from &lt;code&gt;animehacker/llama-turboquant&lt;/code&gt; with &lt;code&gt;-DGGML_CUDA=ON&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;spec.image&lt;/code&gt; on your InferenceService to point at it&lt;/li&gt;
&lt;li&gt;The wrapper entrypoint handles the rest&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you run these benchmarks on different hardware (A100, RTX 3090, etc.), I'd love to see the numbers. Drop a comment or find me on the &lt;a href="https://discord.gg/5GavYFPBBr" rel="noopener noreferrer"&gt;LLMKube Discord&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Benchmarks run on 2026-03-30 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.0).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>kubernetes</category>
      <category>gpu</category>
      <category>ai</category>
    </item>
    <item>
      <title>The $0 Problem: Why Every Tool Says Your On-Prem Inference is Free</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 23 Mar 2026 15:49:13 +0000</pubDate>
      <link>https://dev.to/defilan/the-0-problem-why-every-tool-says-your-on-prem-inference-is-free-3mcb</link>
      <guid>https://dev.to/defilan/the-0-problem-why-every-tool-says-your-on-prem-inference-is-free-3mcb</guid>
      <description>&lt;p&gt;If you run LLMs on your own hardware, every cost tracking tool in the ecosystem has the same answer for what it costs: &lt;strong&gt;$0&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;OpenCost sees your GPU pods but has no concept of tokens. LiteLLM tracks tokens per user but hardcodes on-prem cost to zero. Langfuse traces requests but only prices cloud APIs. The FinOps Foundation's own working group explicitly says on-premises AI cost is "outside the scope."&lt;/p&gt;

&lt;p&gt;Meanwhile, your GPUs cost real money. The H100s draw 700 watts each. Your electricity bill is real. The three-year amortization on $280K of hardware is real. But no tool computes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;true cost per token = (hardware amortization + electricity x GPU power draw) / tokens per hour
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We built InferCost to fix this.&lt;/p&gt;

&lt;h3&gt;
  
  
  What InferCost does
&lt;/h3&gt;

&lt;p&gt;InferCost is an open-source Kubernetes operator (Apache 2.0) that computes the true cost of running AI inference on your own hardware. It's a single controller pod. No database, no UI to host. It plugs into Prometheus and Grafana you already run.&lt;/p&gt;

&lt;p&gt;You declare your hardware economics in a CRD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;finops.infercost.ai/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CostProfile&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-cluster&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hardware&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpuModel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NVIDIA&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;GeForce&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;RTX&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5060&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Ti"&lt;/span&gt;
    &lt;span class="na"&gt;gpuCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;purchasePriceUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;960&lt;/span&gt;
    &lt;span class="na"&gt;amortizationYears&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;electricity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ratePerKWh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.08&lt;/span&gt;
    &lt;span class="na"&gt;pueFactor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;InferCost reads real-time GPU power draw from DCGM, scrapes token counts from your inference engine (llama.cpp, vLLM), does the math, and tells you what your inference actually costs. Per model. Per team. Per token.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we found on real hardware
&lt;/h3&gt;

&lt;p&gt;We deployed InferCost on a homelab running Qwen3-32B on 2x RTX 5060 Ti GPUs. Here are the real numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hourly infrastructure cost&lt;/strong&gt;: $0.053 (amortization + electricity at actual GPU power draw)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per million tokens&lt;/strong&gt;: $0.41 under sustained load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monthly projected&lt;/strong&gt;: $38&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then we compared against cloud APIs (verified pricing as of March 2026):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Cloud Cost&lt;/th&gt;
&lt;th&gt;On-Prem Cost&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;$9.82&lt;/td&gt;
&lt;td&gt;$0.62&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;td&gt;$5.83&lt;/td&gt;
&lt;td&gt;$0.62&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;$3.84&lt;/td&gt;
&lt;td&gt;$0.62&lt;/td&gt;
&lt;td&gt;84%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4-nano&lt;/td&gt;
&lt;td&gt;$0.41&lt;/td&gt;
&lt;td&gt;$0.62&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud 24% cheaper&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last row matters. When the cheapest cloud model is actually cheaper than your hardware, InferCost tells you. The point is not to prove on-prem always wins. The point is to give you the real numbers so you can decide.&lt;/p&gt;

&lt;h3&gt;
  
  
  A note on how we calculate cost
&lt;/h3&gt;

&lt;p&gt;The $28/month on-prem number is your total infrastructure cost: hardware amortization plus electricity, running 24/7. Your GPUs cost money whether or not they're serving requests. The $0.41 per million tokens is the marginal cost during active inference (what each token costs when the system is busy).&lt;/p&gt;

&lt;p&gt;The savings comparison uses total infrastructure cost because that's the honest number. If your GPUs sit idle half the time, that idle time still costs you. This is the same logic as any hardware TCO calculation: you amortize the full purchase price, not just the hours you used it.&lt;/p&gt;

&lt;p&gt;This means your actual savings percentage depends on utilization. At high utilization (GPUs busy most of the day), the savings are dramatic. At low utilization, the math shifts toward cloud APIs for cheap models. InferCost shows you both realities so you can make the right call for each workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  The CLI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;defilantech/tap/infercost
&lt;span class="nv"&gt;$ &lt;/span&gt;infercost compare &lt;span class="nt"&gt;--monthly&lt;/span&gt;

PROVIDER    MODEL              CLOUD/MONTH  ON-PREM/MONTH  SAVINGS/MONTH
Anthropic   claude-opus-4-6    &lt;span class="nv"&gt;$409&lt;/span&gt;         &lt;span class="nv"&gt;$28&lt;/span&gt;            &lt;span class="nv"&gt;$381&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;93%&lt;span class="o"&gt;)&lt;/span&gt;
OpenAI      gpt-5.4            &lt;span class="nv"&gt;$242&lt;/span&gt;         &lt;span class="nv"&gt;$28&lt;/span&gt;            &lt;span class="nv"&gt;$214&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;88%&lt;span class="o"&gt;)&lt;/span&gt;
Google      gemini-2.5-pro     &lt;span class="nv"&gt;$159&lt;/span&gt;         &lt;span class="nv"&gt;$28&lt;/span&gt;            &lt;span class="nv"&gt;$131&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;82%&lt;span class="o"&gt;)&lt;/span&gt;
Google      gemini-2.5-flash   &lt;span class="nv"&gt;$40&lt;/span&gt;          &lt;span class="nv"&gt;$28&lt;/span&gt;            &lt;span class="nv"&gt;$12&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;30%&lt;span class="o"&gt;)&lt;/span&gt;
OpenAI      gpt-5.4-nano       &lt;span class="nv"&gt;$20&lt;/span&gt;          &lt;span class="nv"&gt;$28&lt;/span&gt;            -&lt;span class="nv"&gt;$8&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;cloud cheaper&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What InferCost is NOT
&lt;/h3&gt;

&lt;p&gt;It is not a cloud API cost tracker. If you want to monitor your OpenAI bill, tools like Helicone and LangSmith do that well. InferCost solves a different problem: the cost of running inference on hardware you own, where the economics involve amortization schedules and electricity bills, not API invoices.&lt;/p&gt;

&lt;p&gt;It is also not locked to any specific inference stack. It works with LLMKube, but also with any Kubernetes deployment that runs llama.cpp or vLLM with Prometheus metrics exposed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why open source
&lt;/h3&gt;

&lt;p&gt;The organizations that need on-prem cost tracking the most (healthcare, defense, finance, government) are the same ones that can't send cost data to a SaaS dashboard. They chose on-prem for data sovereignty. A cost tracking tool that phones home defeats the purpose.&lt;/p&gt;

&lt;p&gt;InferCost runs entirely in your cluster. Your cost data never leaves your infrastructure. Apache 2.0, no telemetry, no cloud dependency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Get started
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;defilantech/tap/infercost

&lt;span class="c"&gt;# Or deploy via Helm&lt;/span&gt;
helm repo add infercost https://defilantech.github.io/infercost
helm &lt;span class="nb"&gt;install &lt;/span&gt;infercost infercost/infercost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; dcgm.endpoint&lt;span class="o"&gt;=&lt;/span&gt;http://dcgm-exporter:9400/metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/defilantech/infercost" rel="noopener noreferrer"&gt;github.com/defilantech/infercost&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Website&lt;/strong&gt;: &lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;infercost.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Companion project&lt;/strong&gt;: &lt;a href="https://llmkube.com" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt; (K8s operator for LLM inference)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're running inference on your own hardware and want to know what it actually costs, give it a try. Issues and PRs welcome.&lt;/p&gt;

</description>
      <category>finops</category>
      <category>ai</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>llama.cpp on Kubernetes: The Guide I Wish Existed</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Tue, 17 Mar 2026 06:50:51 +0000</pubDate>
      <link>https://dev.to/defilan/llamacpp-on-kubernetes-the-guide-i-wish-existed-59nm</link>
      <guid>https://dev.to/defilan/llamacpp-on-kubernetes-the-guide-i-wish-existed-59nm</guid>
      <description>&lt;p&gt;It started at my kitchen table.&lt;/p&gt;

&lt;p&gt;I was spending an evening on my laptop, fascinated by how LLMs actually work under the hood. Not the API calls, not the chat interfaces, but the actual inference process. I installed Ollama on my Mac, pulled a model, and within a few hours I was completely hooked.&lt;/p&gt;

&lt;p&gt;If you've done this yourself, you know the feeling. A language model running on your own hardware. No API keys, no usage limits, no data leaving your network. Just you and the model.&lt;/p&gt;

&lt;p&gt;Ollama made it easy to get started, but I quickly wanted to understand what was happening underneath. That led me to llama.cpp, which Ollama uses under the hood, and that's where things really clicked. I could see exactly how the model was being loaded, how layers were offloaded to the GPU, how the inference loop worked. I went from curious to obsessed pretty quickly.&lt;/p&gt;

&lt;p&gt;But then the questions started piling up.&lt;/p&gt;

&lt;p&gt;How do I serve this to my team? How do I run multiple models? What happens when I want to use the NVIDIA GPUs on my Linux server AND the Metal GPU on my Mac? How do I monitor it? How do I manage model versions?&lt;/p&gt;

&lt;p&gt;I come from a DevOps background, so my brain immediately went to Kubernetes. I figured someone had already built this. And while there are some incredible tools out there (Ollama for single-machine use, vLLM for high-throughput NVIDIA clusters), nothing quite did what I wanted: a Kubernetes operator that treats LLM inference as a first-class workload across heterogeneous hardware, including Apple Silicon.&lt;/p&gt;

&lt;p&gt;So I started building &lt;a href="https://github.com/defilantech/LLMKube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt;, an open-source Kubernetes operator for running LLMs with llama.cpp. I'm a big believer in open source, and I wanted this to be open source from day one. The best infrastructure tools are built by communities, not individuals. This guide is everything I've learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We're Building Toward
&lt;/h2&gt;

&lt;p&gt;By the end of this guide, you'll understand how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run llama.cpp on Kubernetes with proper lifecycle management&lt;/li&gt;
&lt;li&gt;Deploy models with a single command or a two-resource YAML&lt;/li&gt;
&lt;li&gt;Use NVIDIA GPUs with CUDA acceleration&lt;/li&gt;
&lt;li&gt;Use Apple Silicon Macs as GPU inference nodes in your cluster&lt;/li&gt;
&lt;li&gt;Split models across multiple GPUs for larger models&lt;/li&gt;
&lt;li&gt;Monitor everything with Prometheus and Grafana&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you just want to try it out quickly, skip ahead to the hands-on quickstart.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with "Just Run llama.cpp"
&lt;/h2&gt;

&lt;p&gt;llama.cpp is an outstanding project. It runs on virtually any hardware, supports dozens of model architectures, and the GGUF format has become the standard for local inference. If you need to run one model on one machine, llama.cpp with llama-server is honestly all you need.&lt;/p&gt;

&lt;p&gt;The challenges show up when you want to operationalize it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model lifecycle.&lt;/strong&gt; You need to download models, verify their integrity, cache them so pods don't re-download 30GB files on every restart, and keep track of what's deployed where.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU scheduling.&lt;/strong&gt; If you have multiple models competing for limited GPU memory, you need something smarter than "first pod wins." Priority queues, memory budgets, and graceful handling of GPU contention all matter when you have real workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Heterogeneous hardware.&lt;/strong&gt; This is the big one. Apple Silicon's Metal GPU can't be accessed from inside a container. Every Kubernetes-based LLM tool I found either ignored Macs entirely or ran them in CPU-only mode, which throws away the best part of the hardware. If you have a Mac Studio with an M4 Ultra sitting on your desk and a Linux server with NVIDIA GPUs in your closet, you shouldn't have to choose between them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability.&lt;/strong&gt; If you're already running Prometheus and Grafana (and if you're running Kubernetes, you probably are), you want inference metrics in the same stack as everything else. Tokens per second, prompt processing time, GPU utilization, model load times, all in one place.&lt;/p&gt;

&lt;h2&gt;
  
  
  How LLMKube Approaches This
&lt;/h2&gt;

&lt;p&gt;LLMKube adds two Custom Resource Definitions to your Kubernetes cluster:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model&lt;/strong&gt; defines what you want to run: the GGUF source URL, quantization level, GPU requirements, and hardware preferences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;InferenceService&lt;/strong&gt; defines how you want to run it: replicas, resource limits, endpoint configuration, and which Model to reference.&lt;/p&gt;

&lt;p&gt;The operator watches these resources and handles everything in between: downloading the model, creating deployments, configuring health checks, setting up llama-server with the right flags, exposing an OpenAI-compatible API, and cleaning up when you delete resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Model&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llama-3-8b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf&lt;/span&gt;
  &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gguf&lt;/span&gt;
  &lt;span class="na"&gt;quantization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Q4_K_M&lt;/span&gt;
  &lt;span class="na"&gt;hardware&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;accelerator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cuda&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llama-3-8b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llama-3-8b&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The operator takes it from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Actual Setup
&lt;/h2&gt;

&lt;p&gt;I want to be transparent about the hardware I run this on, because I think it's important for people to see that you don't need datacenter-grade equipment to make this work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadowstack&lt;/strong&gt; is my primary inference server. It's a desktop PC I built specifically for this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AMD Ryzen 9 7900X (12 cores / 24 threads)&lt;/li&gt;
&lt;li&gt;64GB DDR5-6000&lt;/li&gt;
&lt;li&gt;2x NVIDIA RTX 5060 Ti (16GB VRAM each, 32GB total)&lt;/li&gt;
&lt;li&gt;Samsung 990 Pro 1TB NVMe&lt;/li&gt;
&lt;li&gt;Running MicroK8s as a single-node Kubernetes cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mac Studio&lt;/strong&gt; (M4 Ultra, 36GB unified memory) runs the Metal Agent, which lets Kubernetes orchestrate llama-server natively on macOS with full Metal GPU access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mac Mini&lt;/strong&gt; handles other orchestration workloads.&lt;/p&gt;

&lt;p&gt;On Shadowstack, I run &lt;strong&gt;Qwen3 32B&lt;/strong&gt; with the model split across both 5060 Tis using tensor parallelism. On the Mac Studio, I run &lt;strong&gt;Qwen 30B-A3B&lt;/strong&gt; (a mixture-of-experts model that fits comfortably in 36GB of unified memory). Both are managed by the same LLMKube operator, using the same CRDs, visible through the same monitoring stack.&lt;/p&gt;

&lt;p&gt;Is 36GB of unified memory on the Mac Studio less than I wish I had? Sure. But it still runs a 30B MoE model for real workloads, and that's the point. You work with the hardware you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metal Agent: Running Apple Silicon in Your Cluster
&lt;/h2&gt;

&lt;p&gt;This is the part that gets me the most excited, and the part that I haven't seen anyone else solve.&lt;/p&gt;

&lt;p&gt;Here's the core problem: Apple Silicon GPUs use Metal, not CUDA. Metal isn't accessible from inside a Docker container. So if you put a Mac in your Kubernetes cluster and deploy a pod to it, that pod can only use the CPU. Your M4 Ultra's GPU sits idle.&lt;/p&gt;

&lt;p&gt;The Metal Agent works around this by inverting the typical Kubernetes model. Instead of running inference inside a container, the Metal Agent runs as a native macOS daemon that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Watches the Kubernetes API for InferenceService resources with &lt;code&gt;accelerator: metal&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Spawns llama-server natively on macOS with full Metal GPU access&lt;/li&gt;
&lt;li&gt;Registers the endpoint back into Kubernetes so other services can route to it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From the perspective of any other service in your cluster, the model running on your Mac looks like any other Kubernetes-managed endpoint. You can hit the same OpenAI-compatible API, the same health checks work, the same Prometheus metrics are exposed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On your Mac&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;llama.cpp
llmkube-metal-agent &lt;span class="nt"&gt;--host-ip&lt;/span&gt; 192.168.1.x

&lt;span class="c"&gt;# From anywhere in the cluster&lt;/span&gt;
llmkube deploy qwen-30b-a3b &lt;span class="nt"&gt;--accelerator&lt;/span&gt; metal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same CRD that deploys a model on NVIDIA with CUDA deploys on Apple Silicon with Metal. Just change &lt;code&gt;accelerator: cuda&lt;/code&gt; to &lt;code&gt;accelerator: metal&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-GPU: Splitting Models Across Cards
&lt;/h2&gt;

&lt;p&gt;If you want to run models larger than what fits on a single GPU, llama.cpp supports tensor parallelism across multiple GPUs on the same node. LLMKube automates this through the GPU sharding spec.&lt;/p&gt;

&lt;p&gt;On my Shadowstack box, Qwen3 32B (quantized to Q4_K_M, roughly 20GB) gets split across both 5060 Tis. Each GPU handles a portion of the model's layers, and llama.cpp coordinates the inference across both cards.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hardware&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;accelerator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cuda&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
      &lt;span class="na"&gt;sharding&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;layer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The operator automatically calculates the tensor split ratios and passes the right flags to llama-server. On the dual 5060 Ti setup, I see consistent ~53 tokens/second for 3-8B models and solid performance on the 32B model with the split.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hands-On: Try It in 10 Minutes
&lt;/h2&gt;

&lt;p&gt;You don't need my hardware to try this. Here's the quickest path from zero to running inference on Kubernetes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A Kubernetes cluster (Minikube, kind, K3s, or any managed cluster)&lt;/li&gt;
&lt;li&gt;kubectl configured&lt;/li&gt;
&lt;li&gt;Helm 3&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Install LLMKube
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;defilantech/tap/llmkube

&lt;span class="c"&gt;# Add the Helm repo and install the operator&lt;/span&gt;
helm repo add llmkube https://defilantech.github.io/LLMKube
helm &lt;span class="nb"&gt;install &lt;/span&gt;llmkube llmkube/llmkube &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; llmkube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deploy Your First Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy Phi-4 Mini (3.8B params, from the built-in catalog)&lt;/span&gt;
llmkube deploy phi-4-mini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That single command creates both the Model and InferenceService resources. The operator downloads the GGUF file, spins up a pod with llama-server, and exposes an OpenAI-compatible API. You can also deploy any GGUF model by providing a &lt;code&gt;--source&lt;/code&gt; URL pointing to HuggingFace or any HTTP endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query It
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Port-forward and test&lt;/span&gt;
kubectl port-forward svc/phi-4-mini 8080:8080 &amp;amp;

curl http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "messages": [
      {"role": "user", "content": "What is Kubernetes in one sentence?"}
    ],
    "max_tokens": 100
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Use It With the OpenAI SDK
&lt;/h3&gt;

&lt;p&gt;Since the API is OpenAI-compatible, you can point any OpenAI SDK client at it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phi-4-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works with LangChain, LlamaIndex, and anything else that speaks the OpenAI API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Add GPU Acceleration
&lt;/h3&gt;

&lt;p&gt;If you have an NVIDIA GPU available in your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llmkube deploy llama-3.1-8b &lt;span class="nt"&gt;--gpu&lt;/span&gt; &lt;span class="nt"&gt;--gpu-count&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference is dramatic. On an NVIDIA L4 in GKE, prompt processing goes from 29 tok/s (CPU) to 1,026 tok/s (GPU). Token generation jumps from 4.6 tok/s to 64 tok/s. That's a 17x speedup on generation and 66x on prompt processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Air-Gapped Deployments
&lt;/h2&gt;

&lt;p&gt;Early in my career, I worked in medical IT. That experience gave me an appreciation for environments where data simply cannot leave the network. Healthcare, defense, finance, government: these industries have strict compliance requirements that make cloud-hosted AI a non-starter.&lt;/p&gt;

&lt;p&gt;LLMKube supports air-gapped deployment through PVC-based model sources with SHA256 integrity verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pvc://model-storage/models/llama-3-8b-q4.gguf&lt;/span&gt;
  &lt;span class="na"&gt;sha256&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;a1b2c3d4e5f6...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You stage models to a PersistentVolumeClaim, provide the checksum, and the operator verifies integrity before deploying. No outbound network calls, no container registry pulls at runtime, no data leaving your network.&lt;/p&gt;

&lt;p&gt;This is an area where I think llama.cpp really shines for Kubernetes deployments. The GGUF format is a single file. There's no Python dependency tree, no model sharding across dozens of files, no runtime downloads of tokenizers. You put one file on a PVC, point a CRD at it, and you're running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where LLMKube Fits (and Where It Doesn't)
&lt;/h2&gt;

&lt;p&gt;I want to be honest about this, because there are great tools in this space and picking the right one matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you need maximum throughput for high-concurrency workloads (50+ simultaneous users), use vLLM or SGLang.&lt;/strong&gt; They use PagedAttention, continuous batching, and other optimizations that llama.cpp doesn't have. At scale, vLLM delivers significantly higher request throughput. That's just the reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you just need to run one model on one machine, use Ollama.&lt;/strong&gt; It's simpler, it's elegant, and it handles the single-machine case better than a Kubernetes operator ever will.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLMKube is for the space in between.&lt;/strong&gt; You have a Kubernetes cluster. You have a mix of hardware (maybe NVIDIA GPUs, maybe Apple Silicon, maybe both). You want Kubernetes-native lifecycle management with CRDs, GitOps workflows, and your inference metrics in the same Prometheus/Grafana stack as everything else. You care about air-gapped deployments, GPU scheduling, and model versioning. You're serving a team or a set of internal workloads, not a public-facing API with thousands of concurrent users.&lt;/p&gt;

&lt;p&gt;If that sounds like your situation, LLMKube might be what you're looking for. If it doesn't, I genuinely hope one of the other tools solves your problem. We all benefit from this ecosystem getting better.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;LLMKube is open source (Apache 2.0) and actively developed. Some things I'm excited about on the roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Edge deployment support&lt;/strong&gt; for lightweight Kubernetes distributions like K3s and MicroK8s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AMD GPU support (ROCm)&lt;/strong&gt; with a community contributor already testing on Framework hardware with a Ryzen AI Max+ 395&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;llmkube chat&lt;/code&gt;&lt;/strong&gt; for testing models directly from the CLI without needing curl&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll be honest about one thing that comes up a lot: multi-node distributed inference. llama.cpp has an RPC backend that can split a model across machines over ethernet, and I've been watching it closely. The reality is that over consumer networking (1GbE, 2.5GbE), the performance hit from network round-trips makes it marginal for interactive use. Jeff Geerling tested a four-node Framework cluster and got 0.7 tok/s on Llama 405B. The tech is improving, but today my advice is to scale vertically first. Get a bigger GPU or more unified memory before trying to split across machines. If the RPC backend matures to the point where it's genuinely usable over ethernet, LLMKube will support it, but I'm not going to promise something that isn't ready.&lt;/p&gt;

&lt;p&gt;If any of this is interesting to you, I'd love to hear from you. The project is at &lt;a href="https://github.com/defilantech/LLMKube" rel="noopener noreferrer"&gt;github.com/defilantech/LLMKube&lt;/a&gt;, and we have a &lt;a href="https://discord.gg/Ktz85RFHDv" rel="noopener noreferrer"&gt;Discord&lt;/a&gt; where I hang out and talk about this stuff regularly.&lt;/p&gt;

&lt;p&gt;If you hit issues, open a GitHub issue. If you want to contribute, check the issues labeled &lt;code&gt;good-first-issue&lt;/code&gt;. And if you just want to say hi, that's cool too.&lt;/p&gt;

&lt;p&gt;Thanks for reading. I hope this saves you some of the time I spent figuring all this out.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
