<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Christopher Maher</title>
    <description>The latest articles on DEV Community by Christopher Maher (@defilan).</description>
    <link>https://dev.to/defilan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828578%2Fd03de6fc-1dcb-419b-b336-0d9c7d86f7cc.jpeg</url>
      <title>DEV Community: Christopher Maher</title>
      <link>https://dev.to/defilan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/defilan"/>
    <language>en</language>
    <item>
      <title>TurboQuant on a MacBook Pro: two findings the upstream discussion missed</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Tue, 28 Apr 2026 16:38:41 +0000</pubDate>
      <link>https://dev.to/defilan/turboquant-on-a-macbook-pro-two-findings-the-upstream-discussion-missed-5ae7</link>
      <guid>https://dev.to/defilan/turboquant-on-a-macbook-pro-two-findings-the-upstream-discussion-missed-5ae7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/turboquant-m5-max-long-context" rel="noopener noreferrer"&gt;llmkube.com/blog/turboquant-m5-max-long-context&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A 7-hour overnight bench on an M5 Max, two findings I haven't seen in the upstream community thread, and two PRs back to the LLMKube operator to make TurboQuant a first-class citizen of the InferenceService CRD.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;A TurboQuant-enabled &lt;code&gt;llama-server&lt;/code&gt; on Apple Silicon &lt;strong&gt;runs Qwen3.6-35B-A3B Q8 at up to 1M-token context&lt;/strong&gt; on a 128 GB MacBook Pro M5 Max. Standard &lt;code&gt;f16&lt;/code&gt; KV cache OOMs at 256K. Two findings worth quoting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;At 128K+ context, the 3-bit KV cache (&lt;code&gt;turbo3&lt;/code&gt;) matches or beats the 8-bit cache (&lt;code&gt;q8_0&lt;/code&gt;) on prompt processing.&lt;/strong&gt; Smaller cache means less memory bandwidth pressure during attention, and the throughput gap that exists at short context flips by ~128K depth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;turbo3&lt;/code&gt; and &lt;code&gt;turbo4&lt;/code&gt; split by workload phase.&lt;/strong&gt; Long-context &lt;strong&gt;prefill&lt;/strong&gt; favors &lt;code&gt;turbo3&lt;/code&gt; (~27% faster than &lt;code&gt;turbo4&lt;/code&gt; at 256K). Long-context &lt;strong&gt;decode&lt;/strong&gt; favors &lt;code&gt;turbo4&lt;/code&gt; (~11% faster than &lt;code&gt;turbo3&lt;/code&gt; at 256K). They are not interchangeable — different attention bottlenecks dominate during prefill and decode.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We built &lt;a href="https://github.com/TheTom/llama-cpp-turboquant" rel="noopener noreferrer"&gt;TheTom's &lt;code&gt;feature/turboquant-kv-cache&lt;/code&gt; fork of llama.cpp&lt;/a&gt; for Metal, validated on M5 Max, and took two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why KV cache, why now
&lt;/h2&gt;

&lt;p&gt;If you're running coding agents locally — single-model or architect+editor combos — the binding constraint isn't model weights. It's KV cache.&lt;/p&gt;

&lt;p&gt;Weights you can quantize once, store on disk, and forget. KV cache is generated &lt;strong&gt;per token of context&lt;/strong&gt; at inference time, sized by the model's depth and head dimensions, and held in working memory the entire session. A 35B-class model with &lt;code&gt;flash-attn&lt;/code&gt; on uses roughly &lt;strong&gt;256 KB of fp16 KV per token&lt;/strong&gt;. That sounds small until you do the multiplication:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;fp16 KV&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;~8 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64K&lt;/td&gt;
&lt;td&gt;~16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~32 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;256K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~64 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;~128 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;~256 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 128 GB MacBook with &lt;code&gt;flash-attn&lt;/code&gt; and &lt;code&gt;mlock&lt;/code&gt; on can fit one 35B model at 128K with f16 KV, just barely. 256K doesn't fit. Co-resident two-model setups (architect + editor) don't fit at all past 64K.&lt;/p&gt;

&lt;p&gt;Standard &lt;code&gt;q8_0&lt;/code&gt; quantization halves the KV footprint with sub-1% perplexity penalty. That gets you to 256K with a single model on the Mac.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TurboQuant&lt;/strong&gt; (&lt;a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/" rel="noopener noreferrer"&gt;Google Research, ICLR 2026&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer"&gt;arxiv:2504.19874&lt;/a&gt;) compresses further. Randomized Walsh-Hadamard transforms decorrelate KV blocks before scalar quantization, hitting &lt;strong&gt;~3.25 bits per value&lt;/strong&gt; (&lt;code&gt;turbo3&lt;/code&gt;) or &lt;strong&gt;~4.25 bits per value&lt;/strong&gt; (&lt;code&gt;turbo4&lt;/code&gt;) with attention-fidelity loss inside the noise floor of normal sampling variance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cache type&lt;/th&gt;
&lt;th&gt;bits/value&lt;/th&gt;
&lt;th&gt;Compression vs fp16&lt;/th&gt;
&lt;th&gt;KV at 256K&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;f16&lt;/td&gt;
&lt;td&gt;16.0&lt;/td&gt;
&lt;td&gt;1.0×&lt;/td&gt;
&lt;td&gt;~64 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;q8_0&lt;/td&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;td&gt;2.0×&lt;/td&gt;
&lt;td&gt;~32 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;turbo4&lt;/td&gt;
&lt;td&gt;4.25&lt;/td&gt;
&lt;td&gt;3.8×&lt;/td&gt;
&lt;td&gt;~17 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;turbo3&lt;/td&gt;
&lt;td&gt;3.25&lt;/td&gt;
&lt;td&gt;4.9×&lt;/td&gt;
&lt;td&gt;~13 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Upstream discussion at &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/20969" rel="noopener noreferrer"&gt;ggml-org/llama.cpp#20969&lt;/a&gt;. Not yet in main, landing in forks per backend. &lt;strong&gt;TheTom's fork&lt;/strong&gt; is the Metal-supporting variant.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bench
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;llama-bench&lt;/code&gt; from TheTom's fork build, single Qwen3.6-35B-A3B Q8 model, sweep across cache types and KV-depths.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; Qwen3.6-35B-A3B-Q8_0.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ctk&lt;/span&gt; turbo3 &lt;span class="nt"&gt;-ctv&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; 0 &lt;span class="nt"&gt;-d&lt;/span&gt; 8192 &lt;span class="nt"&gt;-d&lt;/span&gt; 32768 &lt;span class="nt"&gt;-d&lt;/span&gt; 131072 &lt;span class="nt"&gt;-d&lt;/span&gt; 262144 &lt;span class="nt"&gt;-d&lt;/span&gt; 524288 &lt;span class="nt"&gt;-d&lt;/span&gt; 1048576 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 512 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;-fa&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--threads&lt;/span&gt; 6 &lt;span class="nt"&gt;--batch-size&lt;/span&gt; 2048 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-r&lt;/span&gt; 3 &lt;span class="nt"&gt;-o&lt;/span&gt; md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;-d N&lt;/code&gt; pre-allocates N tokens of KV cache before measuring throughput. Mean of 3 reps. Metal-agent stopped during the run for clean memory budget. The 1M cell on &lt;code&gt;turbo3&lt;/code&gt; alone took several hours wall-clock; full sweep ran ~7 hours overnight.&lt;/p&gt;




&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Generation throughput (tok/s)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;f16&lt;/th&gt;
&lt;th&gt;q8_0&lt;/th&gt;
&lt;th&gt;turbo3&lt;/th&gt;
&lt;th&gt;turbo4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;87.4&lt;/td&gt;
&lt;td&gt;79.5&lt;/td&gt;
&lt;td&gt;79.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;79.2&lt;/td&gt;
&lt;td&gt;72.2&lt;/td&gt;
&lt;td&gt;71.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;72.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;67.8&lt;/td&gt;
&lt;td&gt;61.5&lt;/td&gt;
&lt;td&gt;61.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64K&lt;/td&gt;
&lt;td&gt;60.7&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;44.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40.7&lt;/td&gt;
&lt;td&gt;36.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;37.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25.5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;13.3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.51&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Prompt processing throughput (tok/s)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;f16&lt;/th&gt;
&lt;th&gt;q8_0&lt;/th&gt;
&lt;th&gt;turbo3&lt;/th&gt;
&lt;th&gt;turbo4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2962&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2948&lt;/td&gt;
&lt;td&gt;2904&lt;/td&gt;
&lt;td&gt;2854&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2098&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1623&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1653&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1439&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1063&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;802&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;784&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;678&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;321&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;245&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;253&lt;/strong&gt; ← turbo3 ≥ q8_0&lt;/td&gt;
&lt;td&gt;206&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;124&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;128&lt;/strong&gt; ← turbo3 &amp;gt; q8_0&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;66&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full grid is final. Bench ran 8h 20m wall-clock.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 1: turbo3 beats q8_0 at long context
&lt;/h2&gt;

&lt;p&gt;The framing in the upstream discussion is approximately &lt;em&gt;"turbo3 trades a small (~10%) generation throughput hit for ~2.5× more KV memory headroom."&lt;/em&gt; That's true at short context. At long context, &lt;strong&gt;the trade flips&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At 128K depth, f16 wins prefill at 321 tok/s, but &lt;strong&gt;turbo3 at 253 tok/s edges out q8_0 at 245 tok/s&lt;/strong&gt;. At 256K (where f16 OOMs), &lt;strong&gt;turbo3 at 128 tok/s beats q8_0 at 124 tok/s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What's happening: at 35B-class model size with deep contexts, the GPU spends most of its time during attention reading KV cache from memory rather than computing on it. Smaller cache → less bandwidth pressure → throughput recovers, even though there's more dequantization work per access. The break-even is somewhere between 32K and 128K on M5 Max.&lt;/p&gt;

&lt;p&gt;For coding-agent workloads where context grows monotonically across a session, &lt;strong&gt;this is the regime that matters&lt;/strong&gt;. You're spending most of your tokens at 32K+ depth, not at depth 0.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 2: turbo3 and turbo4 split by workload phase
&lt;/h2&gt;

&lt;p&gt;The 25% extra bits per value in &lt;code&gt;turbo4&lt;/code&gt; (4.25 vs 3.25 bits) buys you something specific, and what it buys depends on the phase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefill (prompt processing) at long context:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;turbo3 pp&lt;/th&gt;
&lt;th&gt;turbo4 pp&lt;/th&gt;
&lt;th&gt;turbo3 advantage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;1653&lt;/td&gt;
&lt;td&gt;1439&lt;/td&gt;
&lt;td&gt;+15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;784&lt;/td&gt;
&lt;td&gt;678&lt;/td&gt;
&lt;td&gt;+16%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;253&lt;/td&gt;
&lt;td&gt;206&lt;/td&gt;
&lt;td&gt;+23%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;+27%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;+18%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Smaller cache means less data to read per attention step; during prefill the GPU pulls huge contiguous batches through attention, and the bandwidth-bound regime favors &lt;code&gt;turbo3&lt;/code&gt; cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decode (generation) at long context:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;turbo3 tg&lt;/th&gt;
&lt;th&gt;turbo4 tg&lt;/th&gt;
&lt;th&gt;turbo4 advantage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;36.0&lt;/td&gt;
&lt;td&gt;37.7&lt;/td&gt;
&lt;td&gt;+5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;22.9&lt;/td&gt;
&lt;td&gt;25.5&lt;/td&gt;
&lt;td&gt;+11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;13.3&lt;/td&gt;
&lt;td&gt;16.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+20%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;During decode the dequantization overhead per access matters more than total bytes read. &lt;code&gt;turbo4&lt;/code&gt;'s simpler representation (4.25 bits has less complex quantization geometry than 3.25 bits) wins at the per-token attention pass — and the gap &lt;strong&gt;widens&lt;/strong&gt; with depth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical implications by workload:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload shape&lt;/th&gt;
&lt;th&gt;Cache type&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Aider/OpenCode coding agents (deep context, lots of generated tokens)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;turbo4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wins decode at depth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG-heavy / batch question answering (heavy prefill, short answers)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;turbo3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wins prefill at depth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure context-window maximization (1M context)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;turbo3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Only it fits at 1M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Short-context interactive (≤32K)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;f16&lt;/code&gt; if it fits, else &lt;code&gt;q8_0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Both turbos are ~10% slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This isn't a framing the upstream community discussion has surfaced clearly. Different bottleneck regimes for different phases, and the right cache type depends on which phase dominates your workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this enables on a MacBook
&lt;/h2&gt;

&lt;p&gt;Three concrete capabilities:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;256K context for two co-resident coding models.&lt;/strong&gt; turbo3 KV at 256K (~13 GB) plus 37 GB Qwen3.6 weights, alongside Devstral-Small-2-24B at the same context with comparable footprint, totals ~88 GB. Under the 100 GB practical budget.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;1M context for batch / agentic workloads.&lt;/strong&gt; turbo3 KV at 1M is ~52 GB. We measured &lt;strong&gt;30 tok/s prefill, 6.5 tok/s decode at 1M&lt;/strong&gt; on Qwen3.6-35B-A3B Q8. Slow — a 4K-token agent response at 1M context is ~10 minutes wall-clock — but &lt;strong&gt;it works&lt;/strong&gt;. Overnight agentic batches that need the full context window are feasible. As far as we can tell, nobody else has demonstrated this on Apple Silicon yet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;More headroom for non-attention buffers.&lt;/strong&gt; Cutting KV by 5× makes batch buffers, prefix cache, and draft models for speculative decoding actually composable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Caveats
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TheTom's fork is research-grade.&lt;/strong&gt; Pinned to commit &lt;code&gt;11a241d0d&lt;/code&gt;; rebases needed as upstream moves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLMKube's metal-runtime can't drive turbo3/turbo4 yet&lt;/strong&gt; because of &lt;a href="https://github.com/defilantech/LLMKube/issues/349" rel="noopener noreferrer"&gt;#349&lt;/a&gt; and &lt;a href="https://github.com/defilantech/LLMKube/issues/350" rel="noopener noreferrer"&gt;#350&lt;/a&gt;. &lt;a href="https://github.com/defilantech/LLMKube/pull/353" rel="noopener noreferrer"&gt;PR #353&lt;/a&gt; closes #350; #349 is next.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No perplexity numbers in this run.&lt;/strong&gt; Throughput and memory ceilings only. The +1% perplexity penalty for turbo3 in the upstream discussion is on Qwen 3.5 — we'll re-run on Qwen 3.6 in a follow-up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single hardware sample.&lt;/strong&gt; M5 Max only. Crossover point and prefill/decode split likely shift with memory bandwidth (614 GB/s on M5 Max) and GPU core count.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What we contributed back
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/defilantech/LLMKube/pull/351" rel="noopener noreferrer"&gt;LLMKube PR #351&lt;/a&gt;&lt;/strong&gt; (merged): &lt;code&gt;cacheTypeCustomK&lt;/code&gt;/&lt;code&gt;cacheTypeCustomV&lt;/code&gt; on &lt;code&gt;InferenceServiceSpec&lt;/code&gt;. Closes &lt;a href="https://github.com/defilantech/LLMKube/issues/282" rel="noopener noreferrer"&gt;#282&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/defilantech/LLMKube/pull/353" rel="noopener noreferrer"&gt;LLMKube PR #353&lt;/a&gt;&lt;/strong&gt; (open): metal-agent respawns on ISVC spec drift; honors &lt;code&gt;replicas: 0&lt;/code&gt;. Closes &lt;a href="https://github.com/defilantech/LLMKube/issues/350" rel="noopener noreferrer"&gt;#350&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Issues filed:&lt;/strong&gt; &lt;a href="https://github.com/defilantech/LLMKube/issues/349" rel="noopener noreferrer"&gt;#349&lt;/a&gt;, &lt;a href="https://github.com/defilantech/LLMKube/issues/350" rel="noopener noreferrer"&gt;#350&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comment going to llama.cpp discussion #20969&lt;/strong&gt; with the M5 Max numbers and the prefill/decode split.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Build TheTom's fork&lt;/span&gt;
git clone https://github.com/TheTom/llama-cpp-turboquant.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;-j&lt;/span&gt;

&lt;span class="c"&gt;# 2. Run the bench (turbo3 and turbo4 separately to see the split)&lt;/span&gt;
./build/bin/llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; /path/to/your/model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ctk&lt;/span&gt; turbo3 &lt;span class="nt"&gt;-ctv&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; 0 &lt;span class="nt"&gt;-d&lt;/span&gt; 32768 &lt;span class="nt"&gt;-d&lt;/span&gt; 131072 &lt;span class="nt"&gt;-d&lt;/span&gt; 262144 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 512 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;-fa&lt;/span&gt; 1 &lt;span class="nt"&gt;-r&lt;/span&gt; 3 &lt;span class="nt"&gt;-o&lt;/span&gt; md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Memory ceiling depends on your unified-memory budget; sub-64 GB Macs probably can't reach 256K with a 35B-class model at any cache type. M3 Pro/Max territory is more realistic for 13B models at 128K with turbo3.&lt;/p&gt;

&lt;p&gt;For NVIDIA: &lt;a href="https://github.com/spiritbuun/llama-cpp-turboquant-cuda" rel="noopener noreferrer"&gt;@spiritbuun's CUDA fork&lt;/a&gt; is the equivalent path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Open invitation
&lt;/h2&gt;

&lt;p&gt;If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same bench, &lt;strong&gt;we want your numbers&lt;/strong&gt;. The crossover point and the prefill/decode split likely shift with memory bandwidth.&lt;/p&gt;

&lt;p&gt;Drop results in &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/20969" rel="noopener noreferrer"&gt;llama.cpp discussion #20969&lt;/a&gt; or open an issue on &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;defilantech/llmkube&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 27 Apr 2026 06:24:59 +0000</pubDate>
      <link>https://dev.to/defilan/628-on-aider-polyglot-from-a-macbook-pro-then-the-other-model-we-tried-scored-4-heres-what-17ed</link>
      <guid>https://dev.to/defilan/628-on-aider-polyglot-from-a-macbook-pro-then-the-other-model-we-tried-scored-4-heres-what-17ed</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/m5-max-aider-polyglot-and-finops" rel="noopener noreferrer"&gt;llmkube.com/blog/m5-max-aider-polyglot-and-finops&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A 24-hour Aider Polyglot run, a follow-up bench that blew up in interesting ways, and a working &lt;code&gt;$/MTok&lt;/code&gt; number from a Kubernetes operator that scrapes Apple Silicon power live. Two open-source PRs landed today to make all of this reproducible on any M-series Mac.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This is a coding-model benchmark on locally-served weights, plus a FinOps story.&lt;/strong&gt; Every benchmark number traces to results files we can show you. Every cost number traces to a CSV captured by InferCost during the run. The point is the methodology and the tooling; the model rankings are along for the ride.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3.6-35B-A3B Q8&lt;/strong&gt; (Tongyi Lab, Apache 2.0) hit &lt;strong&gt;62.2% on Aider Polyglot&lt;/strong&gt; (pass_rate_2, n=225/225) running locally on a MacBook Pro M5 Max via LLMKube's Metal Agent. That places it above Claude Sonnet 4 with 32k thinking budget (61.3%), o1-high (61.7%), DeepSeek R1 original (56.9%), and Claude 3.5 Sonnet (51.6%) on the official Aider leaderboard. It also beats every published Qwen-family entry on the Polyglot board.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Devstral-Small-2-2512 Q8&lt;/strong&gt; (Mistral, Apache 2.0) hit &lt;strong&gt;4% on Aider Polyglot diff format&lt;/strong&gt;, &lt;strong&gt;8% on Aider Polyglot whole format&lt;/strong&gt;, and &lt;strong&gt;81.7% on HumanEval+ (164 problems, all passed standard)&lt;/strong&gt;. Same model. 20× swing. Benchmark numbers don't transfer across harnesses, and you should never quote one without naming the other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;InferCost ran the whole time.&lt;/strong&gt; The new Apple Silicon collector (shipped in &lt;a href="https://github.com/defilantech/infercost/releases/tag/v0.3.0" rel="noopener noreferrer"&gt;InferCost v0.3.0&lt;/a&gt;) reconciled &lt;code&gt;$0.18/hr&lt;/code&gt; against the &lt;code&gt;apple-m5-max&lt;/code&gt; CostProfile, with InferCost's reading agreeing with the LLMKube agent's direct gauge within &lt;code&gt;1.6 W&lt;/code&gt; mean delta over the Qwen window. First widely-published &lt;code&gt;$/MTok&lt;/code&gt; number for an Apple Silicon LLM workload that traces to a real Prometheus scrape.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two releases shipped alongside this post&lt;/strong&gt; make all of it reproducible on your own Mac: &lt;a href="https://github.com/defilantech/llmkube/releases/tag/v0.7.2" rel="noopener noreferrer"&gt;LLMKube v0.7.2&lt;/a&gt; (Apple power gauges via powermetrics, security-hardened sudoers, and a one-command &lt;code&gt;make install-powermetrics-sudo&lt;/code&gt;) and &lt;a href="https://github.com/defilantech/infercost/releases/tag/v0.3.0" rel="noopener noreferrer"&gt;InferCost v0.3.0&lt;/a&gt; (Metal collector, condition reporting, sample CostProfile).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. The hardware and what's special about it
&lt;/h2&gt;

&lt;p&gt;The bench machine is a MacBook Pro M5 Max, 2026 model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;40-core integrated, Metal 4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;18-core (6 P-cores, 12 E-cores)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unified memory&lt;/td&gt;
&lt;td&gt;128 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory bandwidth&lt;/td&gt;
&lt;td&gt;614 GB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;macOS 25.4 (Darwin)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Price&lt;/td&gt;
&lt;td&gt;About $4,500 fully configured&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: &lt;a href="https://www.apple.com/newsroom/2026/03/apple-debuts-m5-pro-and-m5-max-to-supercharge-the-most-demanding-pro-workflows/" rel="noopener noreferrer"&gt;Apple newsroom&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The 614 GB/s bandwidth is the constraint that decides everything that follows. For a dense 24B model at Q8, you need to read about 25 GB per generated token, so the upper bound is &lt;code&gt;614 / 25 = 24.56 t/s&lt;/code&gt; and we measured 24 t/s, within 2.3% of the wall. For a MoE like Qwen3.6-35B-A3B, only the active 3B parameters read per token, so the wall is ~200 t/s and you actually get to choose how to spend the bandwidth. That's the whole story behind why MoE feels fast on a Mac.&lt;/p&gt;

&lt;p&gt;Stack: LLMKube v0.7.x with the Metal Agent feature branch from PR #334 cherry-picked in (now main), &lt;code&gt;llama-server&lt;/code&gt; from llama.cpp Metal, and a kind cluster on the same host for the K8s control plane. InferCost was running locally via &lt;code&gt;go run ./cmd/main.go&lt;/code&gt;, pointed at the LLMKube agent's &lt;code&gt;/metrics&lt;/code&gt; endpoint via a new &lt;code&gt;--metal-endpoint&lt;/code&gt; flag.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Qwen3.6-35B-A3B Q8 on Aider Polyglot
&lt;/h2&gt;

&lt;p&gt;The Qwen3.6 family includes a dense 27B and an MoE variant at 35B total / 3B active per token. We ran the MoE quantized to Q8_0 (~36 GB on disk, fits comfortably in 128 GB unified memory with room for KV cache and the rest of macOS).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aider.chat/docs/leaderboards/" rel="noopener noreferrer"&gt;Aider Polyglot&lt;/a&gt; is a 225-problem benchmark across C++, Go, Java, JavaScript, Python, and Rust, designed to keep top frontier coding LLMs in the 5-50% range. Each model gets two attempts per problem: a single-shot solve, and a second attempt after seeing the failed test output. The headline metric is &lt;code&gt;pass_rate_2&lt;/code&gt;, the percentage of problems that passed all tests within those two attempts.&lt;/p&gt;

&lt;p&gt;Aider was driven from inside a Docker container (&lt;code&gt;aider-benchmark&lt;/code&gt; image) talking to llama-server via &lt;code&gt;OPENAI_API_BASE=http://host.docker.internal:&amp;lt;port&amp;gt;/v1&lt;/code&gt;. Edit format was &lt;code&gt;diff&lt;/code&gt; (Aider's standard for capable models). Threads = 4. The model id we passed to LiteLLM was &lt;code&gt;openai/Qwen3.6-35B-A3B-Q8_0.gguf&lt;/code&gt;, the basename llama-server reports.&lt;/p&gt;

&lt;p&gt;The full run took &lt;strong&gt;49.9 hours of inference wall-clock time&lt;/strong&gt; stretched across about 24 hours of real time, plus a follow-up resume cycle to handle a runaway-reasoning failure mode. More on that in §4.&lt;/p&gt;

&lt;h3&gt;
  
  
  The headline result
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;pass_rate_2 = 62.2%&lt;/code&gt; (140 of 225), &lt;code&gt;pass_rate_1 = 34.7%&lt;/code&gt; (78 of 225)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Verified against the official &lt;a href="https://github.com/Aider-AI/aider/blob/main/aider/website/_data/polyglot_leaderboard.yml" rel="noopener noreferrer"&gt;Aider Polyglot leaderboard yaml&lt;/a&gt; pulled today, here's where that lands among the published baselines:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pass_rate_2&lt;/th&gt;
&lt;th&gt;format&lt;/th&gt;
&lt;th&gt;model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;88.0%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;gpt-5 (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;84.9%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;o3-pro (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;81.3%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;o3 (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;79.6%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;grok-4 (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;72.0%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude Opus 4 (32k thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;71.4%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;DeepSeek R1 (0528)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64.9%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude 3.7 Sonnet (32k thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64.0%&lt;/td&gt;
&lt;td&gt;architect&lt;/td&gt;
&lt;td&gt;DeepSeek R1 + Claude 3.5 Sonnet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;62.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;diff&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen3.6-35B-A3B Q8 (this run, M5 Max, Apache 2.0, ours)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;61.7%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;o1-2024-12-17 (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;61.3%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4 (32k thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;60.4%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude 3.7 Sonnet (no thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;59.6%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Qwen3 235B A22B (no think, Alibaba API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;56.9%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;DeepSeek R1 (original)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;56.4%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4 (no thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;51.6%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Qwen3 32B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8.0%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Qwen2.5-Coder-32B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The defensible reads:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Beats Claude Sonnet 4 with 32k thinking budget by &lt;strong&gt;0.9 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Beats o1-high by &lt;strong&gt;0.5 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Beats DeepSeek R1 original by &lt;strong&gt;5.3 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Beats Claude 3.5 Sonnet by &lt;strong&gt;10.6 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Within &lt;strong&gt;2.7 points of Claude 3.7 Sonnet (32k thinking)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Strongest open-weights Qwen-family number on the Polyglot leaderboard. Qwen3 32B sat at 40.0%, Qwen3 235B A22B at 59.6%. The 35B-A3B MoE quantization is doing real work for its size.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What we are not claiming: that this beats Opus 4, GPT-5, o3, or DeepSeek V3.2-Exp Reasoner. Those all sit above us on the leaderboard. Qwen3.6 is in the same band as Sonnet 4 thinking, not in the band with o3-high or GPT-5.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-language
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;pass_1&lt;/th&gt;
&lt;th&gt;pass_2&lt;/th&gt;
&lt;th&gt;p2 %&lt;/th&gt;
&lt;th&gt;avg min/exercise&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;python&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73.5%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;javascript&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;go&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;61.5%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rust&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;56.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cpp&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53.8%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;21.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;java&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;31.9&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things worth noting. First, Python and JavaScript at ~73% looks like clean Sonnet-3.5-thinking territory on the languages most developers actually use Aider for. Second, Java at 31.9 minutes per exercise on average is inflated by the runaway-reasoning case described next. Strip the outlier and Java's average is in line with C++.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The runaway-reasoning failure mode (and the resume that closed it out)
&lt;/h2&gt;

&lt;p&gt;About 21 hours into the run, the container settled into a Java exercise that consumed 80 minutes of wall time without writing a new result file or producing meaningful output. The log mtime stayed frozen, the container stayed "Up," and the model was clearly deep in a reasoning loop with no exit strategy. We stopped the container manually at &lt;strong&gt;n=223/225&lt;/strong&gt; and recorded the runaway-reasoning failure mode as a real characteristic of hybrid-thinking MoE models on agentic harnesses.&lt;/p&gt;

&lt;p&gt;The next night, we &lt;strong&gt;resumed via Aider's official &lt;code&gt;--cont&lt;/code&gt; flag&lt;/strong&gt; against the same run directory. Two missing exercises (&lt;code&gt;rust/forth&lt;/code&gt; and &lt;code&gt;javascript/go-counting&lt;/code&gt;) ran in parallel under &lt;code&gt;--threads 4&lt;/code&gt; and completed in about 6 minutes each. Both failed both attempts. Final result: &lt;strong&gt;n=225/225&lt;/strong&gt;, &lt;strong&gt;pass_rate_2 = 62.2%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The headline ticked &lt;strong&gt;down&lt;/strong&gt; by 0.6 percentage points compared to the n=223 partial (62.8% → 62.2%) because the two missing exercises both failed. That's the most honest defense against any "stopped early to lock in a favorable number" critique: completing the run actually hurt us.&lt;/p&gt;

&lt;p&gt;If you reproduce this and see a similar hang, kill the container, run with &lt;code&gt;--cont&lt;/code&gt; later to fill in the gaps. The full data is healthier than a partial.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The other thing we wanted to test
&lt;/h2&gt;

&lt;p&gt;With Qwen3.6 in hand, the natural next move was a comparison candidate. The ideal contrast: a dense model purpose-built for agentic coding, not a general-purpose coder.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512" rel="noopener noreferrer"&gt;Devstral-Small-2-24B-Instruct-2512&lt;/a&gt; was the obvious pick. Mistral and All Hands AI co-trained it specifically for software-engineering agents, it's Apache 2.0 dense 24B, has a 256K context window, and Mistral published 68.0% SWE-Bench Verified for it (a real number on a real benchmark). Released November 2025, so 5 months old at time of writing. Architecture is the new "Ministral 3 with rope-scaling and Scalable-Softmax" stack from Mistral, structurally different from Devstral 1.x.&lt;/p&gt;

&lt;p&gt;We deployed it via the same LLMKube + Metal Agent path, kicked off Aider Polyglot with &lt;code&gt;--num-tests 25&lt;/code&gt; (random subset, fits a 4-hour window at Devstral's slower decode speed of ~24 t/s), edit format &lt;code&gt;diff&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;&lt;code&gt;pass_rate_2 = 4.0%&lt;/code&gt; (1 of 25)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Almost wrote it off as broken. Then read the Aider results files more carefully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;92% of responses were syntactically well-formed diffs.&lt;/li&gt;
&lt;li&gt;Zero exhausted context windows.&lt;/li&gt;
&lt;li&gt;Average 4.4 minutes per exercise (fast, not stuck).&lt;/li&gt;
&lt;li&gt;The model was producing valid-looking edit blocks, they were just semantically wrong.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model wasn't broken. It was doing what it had been trained to do, which apparently wasn't this.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Investigation
&lt;/h2&gt;

&lt;p&gt;Three hypotheses, ordered by what we tried:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 1: The diff format is the problem.&lt;/strong&gt; Aider supports &lt;code&gt;--edit-format whole&lt;/code&gt; (output complete files instead of diffs). Re-ran with whole format on the same 25-exercise subset.&lt;/p&gt;

&lt;p&gt;Result: &lt;code&gt;pass_rate_2 = 8.0%&lt;/code&gt; (2 of 25). Better, but not by much. Hypothesis weakly supported.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 2: llama.cpp isn't handling Devstral 2's new architecture correctly.&lt;/strong&gt; Worth checking before declaring the model bad. We ran HumanEval+ via &lt;a href="https://github.com/evalplus/evalplus" rel="noopener noreferrer"&gt;evalplus&lt;/a&gt;, pointed at the same llama-server endpoint, with a function-level Python coding harness that doesn't require any agentic tool-call discipline. If llama.cpp's tokenizer or attention implementation was off, we'd see it here.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;&lt;code&gt;HumanEval pass@1 = 85.4%&lt;/code&gt;, &lt;code&gt;HumanEval+ pass@1 = 81.7%&lt;/code&gt;&lt;/strong&gt; (164 problems, scored in &lt;code&gt;ganler/evalplus&lt;/code&gt; Linux container because macOS's &lt;code&gt;setrlimit(RLIMIT_AS)&lt;/code&gt; doesn't behave the way evalplus's sandbox expects).&lt;/p&gt;

&lt;p&gt;That landed Devstral 2 in the same band as the top open-source 24B coders for function-level Python. Architecture is fine. llama.cpp is fine. The model is genuinely capable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 3: The harness is the variable.&lt;/strong&gt; We re-read Mistral's README:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Devstral 2 can also be used with the following scaffoldings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mistral Vibe (recommended)&lt;/li&gt;
&lt;li&gt;Cline&lt;/li&gt;
&lt;li&gt;Kilo Code&lt;/li&gt;
&lt;li&gt;Claude Code&lt;/li&gt;
&lt;li&gt;OpenHands&lt;/li&gt;
&lt;li&gt;SWE Agent&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Aider is not on this list. Devstral 2 was trained on tool-call traces from agentic-coding harnesses that use multi-turn function calls, not Aider's single-prompt-with-diff edit format. The model was producing what its training distribution rewarded; Aider's harness was scoring it on a different distribution entirely.&lt;/p&gt;

&lt;p&gt;Mistral itself adds, in the same README:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;we advise everyone to use the Mistral AI API if the model is subpar with local serving&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's an explicit caveat from the model authors. The 4% wasn't a model failure or a runtime failure. It was a harness-distribution mismatch, exactly the failure mode the README warned about.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Same model, three benchmarks, three answers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Devstral 2 score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Aider Polyglot, diff format&lt;/td&gt;
&lt;td&gt;4.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aider Polyglot, whole format&lt;/td&gt;
&lt;td&gt;8.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval+ (with adversarial tests)&lt;/td&gt;
&lt;td&gt;81.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval (base)&lt;/td&gt;
&lt;td&gt;85.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Twenty times difference in measured "performance" on the same model, same hardware, same temperature, same week. This is the lesson worth taking away from the entire bench session.&lt;/p&gt;

&lt;p&gt;If you publish a single benchmark number for any agentic coding model, you are publishing a story about that model's compatibility with one specific harness, not a story about the model's coding capability. The Devstral 2 4% on Aider does not mean Devstral 2 is bad at coding. The Devstral 2 81.7% on HumanEval+ does not mean Devstral 2 is good at agentic edits in your IDE. They are both true and they describe different things.&lt;/p&gt;

&lt;p&gt;If you want to evaluate a coding model, run it through the harness you actually use day to day. If you can't, then quote at least two benchmarks from different parts of the harness landscape (one function-level, one agentic) and let the reader see the spread.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. InferCost was running the whole time
&lt;/h2&gt;

&lt;p&gt;While the benchmarks were producing accuracy numbers, &lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt; was producing the cost numbers. The new Apple Silicon collector (shipped in &lt;a href="https://github.com/defilantech/infercost/releases/tag/v0.3.0" rel="noopener noreferrer"&gt;InferCost v0.3.0&lt;/a&gt;) was reconciling the &lt;code&gt;apple-m5-max&lt;/code&gt; CostProfile every 30 seconds against the LLMKube Metal Agent's &lt;code&gt;apple_power_combined_watts&lt;/code&gt; gauge.&lt;/p&gt;

&lt;p&gt;Specifically, two things were running in the background of every benchmark above:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A second LLMKube Metal Agent on port 9091 with &lt;code&gt;--apple-power-enabled&lt;/code&gt;, publishing the four new &lt;code&gt;apple_power_*_watts&lt;/code&gt; Prometheus gauges sourced from a sudo'd &lt;code&gt;powermetrics&lt;/code&gt; subprocess. Pinned-argv NOPASSWD sudoers entry to keep the privilege grant tight (security audit caught and fixed three findings before merge: argv pinning, bin override rejection, absolute &lt;code&gt;/usr/bin/sudo&lt;/code&gt; to defeat $PATH attacks).&lt;/li&gt;
&lt;li&gt;InferCost as a local controller, pointed at &lt;code&gt;:9091/metrics&lt;/code&gt; via the new &lt;code&gt;--metal-endpoint&lt;/code&gt; CLI flag, reconciling an &lt;code&gt;apple-m5-max&lt;/code&gt; CostProfile using the new Metal scraper and dispatcher.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Plus a tiny CSV poller that sampled both layers every 60 seconds, writing 388 rows of telemetry across the day.&lt;/p&gt;

&lt;p&gt;Per-window aggregates, captured live during the runs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Mean combined W&lt;/th&gt;
&lt;th&gt;Mean InferCost $/hr&lt;/th&gt;
&lt;th&gt;Agent ↔ InferCost Δ (mean)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.6-35B-A3B Q8 (full Aider)&lt;/td&gt;
&lt;td&gt;200 min&lt;/td&gt;
&lt;td&gt;27.3 W&lt;/td&gt;
&lt;td&gt;$0.1775&lt;/td&gt;
&lt;td&gt;1.60 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, Aider diff&lt;/td&gt;
&lt;td&gt;32 min&lt;/td&gt;
&lt;td&gt;32.7 W&lt;/td&gt;
&lt;td&gt;$0.1773&lt;/td&gt;
&lt;td&gt;6.21 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, Aider whole&lt;/td&gt;
&lt;td&gt;29 min&lt;/td&gt;
&lt;td&gt;35.3 W&lt;/td&gt;
&lt;td&gt;$0.1774&lt;/td&gt;
&lt;td&gt;8.08 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, HumanEval+&lt;/td&gt;
&lt;td&gt;55 min&lt;/td&gt;
&lt;td&gt;29.0 W&lt;/td&gt;
&lt;td&gt;$0.1770&lt;/td&gt;
&lt;td&gt;0.90 W&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "Agent ↔ InferCost Δ" column is the validation result. The agent reads powermetrics every second; InferCost samples the gauge during its 30-second reconcile loop. If they were deeply wrong about each other we'd see double-digit deltas. We don't. Across the four windows, mean delta ranged from 0.9 W to 8 W (the 8 W was during Aider whole format, which has bursty prefill that the 30-second reconcile sometimes catches mid-spike). For sustained workloads the agreement is sub-watt.&lt;/p&gt;

&lt;p&gt;Here is what &lt;code&gt;kubectl get costprofile apple-m5-max -o yaml&lt;/code&gt; looked like during the run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;currentPowerDrawWatts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;39.13&lt;/span&gt;
  &lt;span class="na"&gt;hourlyCostUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.1805&lt;/span&gt;
  &lt;span class="na"&gt;amortizationRatePerHour&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.17466&lt;/span&gt;
  &lt;span class="na"&gt;electricityCostPerHour&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.00341&lt;/span&gt;
  &lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MetalReachable&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True"&lt;/span&gt;
    &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MetalHealthy&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Metal&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;healthy&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;http://localhost:9091/metrics&lt;/span&gt;
              &lt;span class="s"&gt;(39.1W&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;combined;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;gpu=37.3&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cpu=1.8&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ane=0.0)."&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ready&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True"&lt;/span&gt;
    &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CostComputed&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hourly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cost:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$0.1805&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(amort:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$0.1747,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;elec:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$0.0059)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not a screenshot. Not a slide. The actual reconcile output from a Kubernetes operator scraping a sudo'd &lt;code&gt;powermetrics&lt;/code&gt; subprocess on the same Mac that was running the benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. The cost economics
&lt;/h2&gt;

&lt;p&gt;The $4,500 laptop, amortized over 3 years, with maintenance at 2% of the purchase price flat:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amortization per hour: &lt;code&gt;$4,500 × 1.02 / 3 / 8760 = $0.17466/hr&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Electricity at 41 W and $0.08/kWh (Peninsula Light residential rate, WA): &lt;code&gt;0.041 × 0.08 = $0.00328/hr&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Total hourly: &lt;strong&gt;$0.178/hr, of which 98.1% is amortization and 1.9% is electricity&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That ratio is the most useful thing the bench taught us. The marginal cost of running an LLM on a laptop you already own is essentially the electricity, which on Apple Silicon is genuinely cheap. The amortized cost is the laptop existing at all, which you pay whether or not the model runs.&lt;/p&gt;

&lt;p&gt;Two &lt;code&gt;$/MTok&lt;/code&gt; numbers from the windows where the token poller was working correctly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Total tokens&lt;/th&gt;
&lt;th&gt;$/MTok&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, Aider whole (sustained edits)&lt;/td&gt;
&lt;td&gt;158,614&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.30&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, HumanEval+ (sequential function calls)&lt;/td&gt;
&lt;td&gt;90,916&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1.76&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Aider's whole-file edits keep the GPU producing tokens for longer continuous bursts, which spreads the fixed amortization across more output. HumanEval+ runs many short function-level problems with eval-script setup time between them, which inflates the per-token cost because the laptop is "active" but not generating.&lt;/p&gt;

&lt;p&gt;Stacked against &lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;Anthropic's published 2026 pricing&lt;/a&gt; of $3/MT input + $15/MT output for Claude Sonnet 4.6, blended around $6 to $9 per million total tokens depending on input:output ratio:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local Devstral 2 sustained at &lt;strong&gt;$0.30/MTok&lt;/strong&gt;: about &lt;strong&gt;30× cheaper&lt;/strong&gt; at the margin than cloud Sonnet 4.6.&lt;/li&gt;
&lt;li&gt;Local Devstral 2 with idle gaps at &lt;strong&gt;$1.76/MTok&lt;/strong&gt;: about &lt;strong&gt;5× cheaper&lt;/strong&gt; at the margin.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both ratios assume the laptop is running 24/7 for the 3-year amortization horizon. If you actually use the laptop 8 hours a day, the effective amortization-per-active-hour is 3× higher, which compresses the ratio. If you use it 2 hours a day, 12× higher, ratio collapses. The InferCost &lt;code&gt;UsageReport&lt;/code&gt; CRD is built specifically to compute the active vs idle split over a billing period, which is the FinOps question that nobody else is answering for Apple Silicon.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. What we shipped today, and how to use it
&lt;/h2&gt;

&lt;p&gt;Two releases shipped alongside this post, both of which were necessary to do the cost story above end to end:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/defilantech/llmkube/releases/tag/v0.7.2" rel="noopener noreferrer"&gt;LLMKube v0.7.2&lt;/a&gt;: Apple Silicon power gauges via powermetrics + one-command sudoers install&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds 4 new Prometheus gauges (&lt;code&gt;combined / gpu / cpu / ane&lt;/code&gt; watts) to the existing Metal Agent (&lt;a href="https://github.com/defilantech/llmkube/pull/334" rel="noopener noreferrer"&gt;PR #334&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Sourced from a sudo'd &lt;code&gt;powermetrics --samplers cpu_power,gpu_power -i 1000&lt;/code&gt; subprocess&lt;/li&gt;
&lt;li&gt;Opt-in via &lt;code&gt;--apple-power-enabled&lt;/code&gt; flag (defaults off)&lt;/li&gt;
&lt;li&gt;NOPASSWD sudoers fragment with &lt;strong&gt;pinned argv&lt;/strong&gt; for safe install (security audit caught and fixed three findings before merge: argv pinning, &lt;code&gt;--powermetrics-bin&lt;/code&gt; override rejection, absolute &lt;code&gt;/usr/bin/sudo&lt;/code&gt; to defeat $PATH substitution attacks)&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;make install-powermetrics-sudo&lt;/code&gt; and &lt;code&gt;make uninstall-powermetrics-sudo&lt;/code&gt; targets (&lt;a href="https://github.com/defilantech/llmkube/pull/336" rel="noopener noreferrer"&gt;PR #336&lt;/a&gt;) so the privileged install is one command instead of a 5-line &lt;code&gt;sed&lt;/code&gt; + &lt;code&gt;visudo&lt;/code&gt; + &lt;code&gt;install&lt;/code&gt; shell incantation&lt;/li&gt;
&lt;li&gt;Coverage gap closed: extracted helper at 100% test coverage&lt;/li&gt;
&lt;li&gt;Zero impact on existing setups; without the flag, behavior is unchanged&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/defilantech/infercost/releases/tag/v0.3.0" rel="noopener noreferrer"&gt;InferCost v0.3.0&lt;/a&gt;: Apple Silicon (Metal) power collector&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds &lt;code&gt;internal/scraper/metal.go&lt;/code&gt; mirroring the existing DCGM scraper (&lt;a href="https://github.com/defilantech/infercost/pull/47" rel="noopener noreferrer"&gt;PR #47&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;MetalReachable&lt;/code&gt; condition with reasons &lt;code&gt;MetalHealthy / MetalNotConfigured / MetalScrapeError / MetalSamplerOff&lt;/code&gt; so operators on a Mac don't see "DCGM unreachable" messages&lt;/li&gt;
&lt;li&gt;10-line dispatcher in the CostProfile reconciler keys off &lt;code&gt;MetalEndpoint&lt;/code&gt; set + &lt;code&gt;looksApple(gpuModel)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;apple-m5-max.yaml&lt;/code&gt; sample CostProfile and updated &lt;code&gt;apple-m2-ultra.yaml&lt;/code&gt; with real setup steps&lt;/li&gt;
&lt;li&gt;8 controller tests + 5 scraper tests; existing DCGM tests untouched&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have a MacBook Pro M5 (or M3/M4 Max with enough memory), the full install is now five short steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install llama.cpp (needed by the Metal Agent for serving GGUF weights)&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;llama.cpp

&lt;span class="c"&gt;# 2. Install LLMKube via Helm&lt;/span&gt;
helm repo add llmkube https://defilantech.github.io/llmkube
helm &lt;span class="nb"&gt;install &lt;/span&gt;llmkube llmkube/llmkube &lt;span class="nt"&gt;--version&lt;/span&gt; 0.7.2

&lt;span class="c"&gt;# 3. Build + install the Metal Agent and grant powermetrics access&lt;/span&gt;
git clone https://github.com/defilantech/llmkube &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;llmkube
make install-metal-agent          &lt;span class="c"&gt;# builds + installs the launchd service&lt;/span&gt;
make install-powermetrics-sudo    &lt;span class="c"&gt;# one-command pinned-argv NOPASSWD sudoers install&lt;/span&gt;

&lt;span class="c"&gt;# 4. Restart the agent with --apple-power-enabled in your launchd plist&lt;/span&gt;
&lt;span class="c"&gt;#    (edit ~/Library/LaunchAgents/com.llmkube.metal-agent.plist, then reload)&lt;/span&gt;
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load   ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

&lt;span class="c"&gt;# 5. Deploy InferCost pointed at the agent and apply the sample CostProfile&lt;/span&gt;
helm repo add infercost https://defilantech.github.io/infercost
helm &lt;span class="nb"&gt;install &lt;/span&gt;infercost infercost/infercost &lt;span class="nt"&gt;--version&lt;/span&gt; 0.3.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; metal.endpoint&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:9090/metrics
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/defilantech/infercost/main/config/samples/costprofiles/apple-m5-max.yaml

&lt;span class="c"&gt;# Watch the live reconcile&lt;/span&gt;
kubectl get costprofile apple-m5-max &lt;span class="nt"&gt;-o&lt;/span&gt; yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;make install-powermetrics-sudo&lt;/code&gt; step is the one privileged moment: sudo prompts you for your password, the make target validates the sudoers syntax with &lt;code&gt;visudo -cf&lt;/code&gt; before installing, then echoes the granted command back so you can verify exactly what was authorized. The grant is scoped to &lt;code&gt;/usr/bin/powermetrics --samplers cpu_power\,gpu_power -i [0-9]*&lt;/code&gt; and nothing else. To remove it later, &lt;code&gt;make uninstall-powermetrics-sudo&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Edit &lt;code&gt;purchasePriceUSD&lt;/code&gt;, &lt;code&gt;electricity.ratePerKWh&lt;/code&gt;, and &lt;code&gt;nodeSelector&lt;/code&gt; in the CostProfile to match your reality.&lt;/p&gt;

&lt;p&gt;Both projects are open source and hungry for the kind of feedback that comes from running them on hardware we don't have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLMKube&lt;/strong&gt; (github.com/defilantech/llmkube). Kubernetes-native LLM serving operator. Runs llama.cpp and vLLM on NVIDIA, Metal Agent for Apple Silicon. Stars and &lt;code&gt;good-first-issue&lt;/code&gt; PRs both very welcome. The Metal Agent in particular benefits enormously from Mac-having developers running it through &lt;code&gt;--apple-power-enabled&lt;/code&gt;, finding the edge cases we missed, and filing issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;InferCost&lt;/strong&gt; (github.com/defilantech/infercost). Kubernetes-native AI FinOps. Cost attribution per workload, namespace, and model, with both NVIDIA (DCGM) and now Apple Silicon (this PR) power sources. The &lt;code&gt;UsageReport&lt;/code&gt; CRD is the next thing to push on; if you have a multi-Mac fleet or a mixed NVIDIA+Apple environment, we'd love to hear what reports would help your team.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10. Reproducibility
&lt;/h2&gt;

&lt;p&gt;Every number in this post traces back to a file you can pull or a benchmark you can re-run.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLMKube: github.com/defilantech/llmkube, main branch at commit &lt;code&gt;58a94a7&lt;/code&gt; (PR #334 merged). Issue #335 closed.&lt;/li&gt;
&lt;li&gt;InferCost: github.com/defilantech/infercost, main branch at commit &lt;code&gt;422a4f0&lt;/code&gt; (PR #47 merged). Issue #46 closed.&lt;/li&gt;
&lt;li&gt;Aider Polyglot harness: github.com/Aider-AI/aider with &lt;a href="https://github.com/Aider-AI/polyglot-benchmark" rel="noopener noreferrer"&gt;polyglot-benchmark&lt;/a&gt; exercises.&lt;/li&gt;
&lt;li&gt;Aider Polyglot leaderboard: &lt;a href="https://github.com/Aider-AI/aider/blob/main/aider/website/_data/polyglot_leaderboard.yml" rel="noopener noreferrer"&gt;polyglot_leaderboard.yml on aider main&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;evalplus: github.com/evalplus/evalplus, scored via &lt;code&gt;ganler/evalplus&lt;/code&gt; container for the macOS &lt;code&gt;RLIMIT_AS&lt;/code&gt; workaround.&lt;/li&gt;
&lt;li&gt;Run scripts: &lt;code&gt;aider/run-aider-polyglot.sh&lt;/code&gt; (Qwen) and &lt;code&gt;aider/run-aider-devstral.sh&lt;/code&gt; (Devstral) on this host, both straightforward Bash that invoke the Aider docker container with the right model id and edit format.&lt;/li&gt;
&lt;li&gt;Power + cost telemetry: &lt;code&gt;/tmp/infercost-m5max-telemetry.csv&lt;/code&gt; (388 power samples) and &lt;code&gt;/tmp/infercost-m5max-tokens.csv&lt;/code&gt; (333 llama-server token-counter samples). Window markers (&lt;code&gt;# QWEN_RUN_END&lt;/code&gt;, &lt;code&gt;# DEVSTRAL_RUN_START&lt;/code&gt;, etc.) inline in the CSV.&lt;/li&gt;
&lt;li&gt;Sample CostProfile: &lt;code&gt;config/samples/costprofiles/apple-m5-max.yaml&lt;/code&gt; in the InferCost repo.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a reproducer hits something different, please open an issue against whichever repo is the closest fit. The Apple Silicon path in particular is brand new, and the cohort of people who could give it a real workout is small but motivated.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;A few things the data points to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The InferCost &lt;code&gt;UsageReport&lt;/code&gt; CRD needs a real multi-day test on a Mac running mixed inference + idle. The active vs idle split is the FinOps lever for local models, and we have one day of data; we want a month.&lt;/li&gt;
&lt;li&gt;Multi-Mac fleet support in InferCost (auto-discovery of LLMKube Metal Agents via label selector) would let teams deploy InferCost once and have it follow agents around. Issue tracking that is open.&lt;/li&gt;
&lt;li&gt;We benched Devstral 2 on Aider and HumanEval+. We did &lt;em&gt;not&lt;/em&gt; bench it on its native scaffold (Mistral Vibe / OpenHands / Cline). That comparison is the right one for a daily-driver evaluation and it's the next thing we'll publish.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're running local LLM inference on your own hardware and care about either the serving side (LLMKube) or the cost side (InferCost), the easiest way to push these projects forward is to point them at your environment, file the issue you'd want to fix, and let us know what number would actually help your team.&lt;/p&gt;

&lt;p&gt;Both projects are Apache 2.0. Stars on &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt; and &lt;a href="https://github.com/defilantech/infercost" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt; are appreciated and signal the kind of validation that helps prioritize the next round of work.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>We ran Qwen3.6-27B on $800 of consumer GPUs, day one: llama.cpp vs vLLM</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Fri, 24 Apr 2026 03:06:36 +0000</pubDate>
      <link>https://dev.to/defilan/we-ran-qwen36-27b-on-800-of-consumer-gpus-day-one-llamacpp-vs-vllm-mg1</link>
      <guid>https://dev.to/defilan/we-ran-qwen36-27b-on-800-of-consumer-gpus-day-one-llamacpp-vs-vllm-mg1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/qwen3-6-27b-bakeoff" rel="noopener noreferrer"&gt;llmkube.com/blog/qwen3-6-27b-bakeoff&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A Kubernetes-native bake-off on 2× RTX 5060 Ti, with reproducible manifests and a cost-per-token number neither cloud nor OSS FinOps tools will tell you.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This is a runtime comparison, not a model evaluation.&lt;/strong&gt; Both llama.cpp and vLLM serve the same Qwen3.6-27B in every cell; we're measuring how the two serving stacks differ on identical work. Where cloud APIs enter in §8, it's on cost, not capability — this post makes no claim about whether Qwen3.6-27B "beats" GPT-4o or Claude on task quality.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3.6-27B&lt;/strong&gt; (Tongyi Lab, released 2026-04-21, Apache 2.0) runs on a pair of &lt;strong&gt;RTX 5060 Ti 16 GB&lt;/strong&gt; consumer cards via Kubernetes + LLMKube. Total hardware: about $800 street.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM wins throughput by 3 to 4×&lt;/strong&gt; at high concurrency thanks to NVFP4 and PagedAttention. &lt;strong&gt;llama.cpp plus TurboQuant wins context&lt;/strong&gt; — we served one 43K-token prompt end-to-end (a single captured sample; higher-concurrency cells timed out on our 300 s harness budget) on hardware where vLLM's in-memory cap is 16K.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per million tokens is two numbers&lt;/strong&gt;, not one: &lt;strong&gt;$0.13 amortized&lt;/strong&gt; (full cost of ownership) and &lt;strong&gt;$0.010 marginal&lt;/strong&gt; (electricity during active serving). At 32.7% utilization over the bench window, the 13× gap between them is the real FinOps conversation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Everything is reproducible.&lt;/strong&gt; Manifests, harness, and &lt;code&gt;summary.csv&lt;/code&gt; at &lt;a href="https://github.com/defilantech/llmkube-bench" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube-bench&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. Why we did this
&lt;/h2&gt;

&lt;p&gt;Two days ago, Tongyi Lab dropped Qwen3.6-27B with the claim it matches frontier agentic-coding models at the 27B parameter count. The community response was predictable: does this actually work locally, or is it another model that benchmarks well but nobody can run? (Note for readers comparing against Qwen3.6-35B-A3B: the 27B is the non-MoE sibling. None of the MoE-specific flags like &lt;code&gt;--cpu-moe&lt;/code&gt; apply here.)&lt;/p&gt;

&lt;p&gt;The ecosystem has a harder time answering "how should I serve it?" There are two dominant open-source inference runtimes for models like this, and they optimize for different things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; — ubiquitous, GGUF-based, broad quantization support, runs on almost anything with a GPU. Adopted by the hobbyist and homelab crowd. Recently grew TurboQuant KV-cache compression (&lt;a href="https://github.com/ggml-org/llama.cpp/discussions/20969" rel="noopener noreferrer"&gt;ggml-org/llama.cpp#20969&lt;/a&gt;), pushing achievable context windows on small VRAM into territory nobody else touches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; — throughput-focused, PagedAttention, continuous batching, FP8/NVFP4 on recent NVIDIA. The production serving runtime for teams running real traffic, targeting data center hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ecosystem answers "which should I use" with vibes and forum posts. We wanted numbers — from the same hardware, same model, same day the model dropped. If a 27B-class model can genuinely run on a pair of $400 GPUs, the practical question for anyone thinking about on-prem inference is which runtime makes that hardware actually worth something.&lt;/p&gt;

&lt;p&gt;So we benchmarked both, published every configuration, and then turned the token counts into dollars using our companion tool &lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt;, so the "is it cheaper than the cloud?" question has an honest answer rather than the usual founder-math.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Hardware and the constraint
&lt;/h2&gt;

&lt;p&gt;The node running this bench is &lt;strong&gt;shadowstack&lt;/strong&gt; — a microk8s cluster on a single box:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPUs&lt;/td&gt;
&lt;td&gt;2× NVIDIA GeForce RTX 5060 Ti 16 GB (Blackwell GB206)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU memory&lt;/td&gt;
&lt;td&gt;15.48 GiB usable per card after driver reserve (30.96 GiB aggregate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04.3 LTS, kernel 6.17.0-oem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes&lt;/td&gt;
&lt;td&gt;MicroK8s v1.32.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;LLMKube operator (chart 0.7.0) + NVIDIA GPU Operator + DCGM exporter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Street price&lt;/td&gt;
&lt;td&gt;about $400/card × 2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;5060 Ti is a &lt;strong&gt;Blackwell consumer GPU with native FP4 hardware&lt;/strong&gt;. That is load-bearing. Without NVFP4, the 27B class is out of reach. At BF16 the model would need about 55 GB, at FP8 about 28 GB, at NVFP4 about 14 GB. Only the last one fits 2× 16 GB with room for activations and KV cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The VRAM budget is the whole story.&lt;/strong&gt; On enterprise hardware (H100, A100, even the 3090 that the community's "qwen 27B on a 3090" discourse is built on), most of this bake-off's complexity disappears. On 2× 16 GB consumer cards you are constantly one configuration flag away from an out-of-memory crash, and the runtime that lets you navigate that wins real users.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The first attempt that didn't work
&lt;/h2&gt;

&lt;p&gt;Our original target was &lt;code&gt;Qwen/Qwen3.5-27B-FP8&lt;/code&gt; (Qwen's official FP8 safetensors, the model everyone was excited about). On paper: 28 GB weights, TP=2, about 14 GB per shard. Should fit.&lt;/p&gt;

&lt;p&gt;It doesn't. Qwen's 27B-class FP8 release is a &lt;strong&gt;VLM&lt;/strong&gt; — the checkpoint includes a vision encoder that stays resident in VRAM whether or not you ever send an image. Three successive mitigations on vLLM, each measured against the crash logs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Default config.&lt;/strong&gt; OOM during &lt;code&gt;profile_run&lt;/code&gt; on the vision encoder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CUDA out of memory. Tried to allocate 576.00 MiB.
GPU 0 has a total capacity of 15.48 GiB of which 175.19 MiB is free.
This process has 15.30 GiB memory in use.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. &lt;code&gt;--limit-mm-per-prompt image=0,video=0&lt;/code&gt;, &lt;code&gt;maxModelLen&lt;/code&gt; 16K, &lt;code&gt;max-num-batched-tokens&lt;/code&gt; 4K.&lt;/strong&gt; Skipped multimodal dummy inputs during profile. The vision encoder weights stay resident. OOM now at &lt;code&gt;determine_available_memory&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tried to allocate 1.19 GiB.
GPU 0 has 1.02 GiB free.
This process has 14.45 GiB in use.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;--gpu-memory-utilization 0.95&lt;/code&gt;, &lt;code&gt;PYTORCH_ALLOC_CONF=expandable_segments:True&lt;/code&gt;.&lt;/strong&gt; Pushed against the wall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tried to allocate 32.00 MiB.
GPU 0 has 3.19 MiB free.
This process has 15.47 GiB in use.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;15.47 of 15.48 GiB. No knob left. &lt;strong&gt;Qwen3.5-27B-FP8 cannot be served via vLLM on 2× 16 GB consumer cards in any configuration we found.&lt;/strong&gt; A 3090 or 4090 (24 GB) would have considerably more headroom for the vision encoder plus KV cache (we didn't reproduce on one, but it's plausible the default config would fit there). That's a real hardware-sizing footnote to the "run 27B locally" discourse, since not every pair of 16 GB cards is enough.&lt;/p&gt;

&lt;p&gt;Then Qwen3.6-27B dropped, and within 24 hours the community had published &lt;strong&gt;NVFP4&lt;/strong&gt; quants that halve the weight footprint again. That is the pivot that made this bench possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Method
&lt;/h2&gt;

&lt;p&gt;Both runtimes run Qwen3.6-27B, served via LLMKube as a Kubernetes Deployment with OpenAI-compatible endpoints, and are benchmarked against each other on identical workloads. All manifests live in the public repo.&lt;/p&gt;

&lt;h3&gt;
  
  
  llama.cpp candidate
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;unsloth/Qwen3.6-27B-GGUF&lt;/code&gt; Q4_K_M (~17 GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallelism&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;split-mode=layer&lt;/code&gt; across both GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TurboQuant&lt;/strong&gt; &lt;code&gt;tbqp3&lt;/code&gt; (keys) + &lt;code&gt;tbq3&lt;/code&gt; (values) — about 3 bits/element&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max context&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;65,536&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image&lt;/td&gt;
&lt;td&gt;AmesianX's TurboQuant fork v1.5.2, built from source (Kaniko manifest in the bench repo; retarget to your own registry to reproduce)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flash attention&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel slots&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;16 for short patterns&lt;/strong&gt; (chat, coding, agentic), &lt;strong&gt;1 for long-context patterns&lt;/strong&gt; (&lt;code&gt;long_context&lt;/code&gt;, &lt;code&gt;long_context_extreme&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TurboQuant is AmesianX's llama.cpp fork implementing the KV-cache compression algorithm from &lt;a href="https://arxiv.org/pdf/2504.19874" rel="noopener noreferrer"&gt;Google Research's TurboQuant paper&lt;/a&gt;. Asymmetric: QJL correction (tbqp*) on keys only because keys feed Q·K inner products while values go through a softmax-weighted sum. Our own internal benchmarks show about 60% KV cache reduction vs f16 at the same context, the table stakes for pushing context on small VRAM.&lt;/p&gt;

&lt;p&gt;The slot count asymmetry matters and we want to be upfront about it: llama.cpp divides &lt;code&gt;--ctx-size&lt;/code&gt; by &lt;code&gt;--parallel&lt;/code&gt; to get per-slot context. With &lt;code&gt;parallelSlots=16&lt;/code&gt; and 65K total context, each slot gets 4 K tokens, which is enough for chat/coding/agentic prompts but rejects 5 K+ long-context requests. Dropping to &lt;code&gt;parallelSlots=1&lt;/code&gt; gives every request the full 65 K, at the cost of serving concurrent long-context requests from a queue. Readers should treat llama.cpp's &lt;code&gt;long_context&lt;/code&gt; c=16/c=64 numbers as queue-behavior measurements, not throughput measurements.&lt;/p&gt;

&lt;h3&gt;
  
  
  vLLM candidate
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sakamakismile/Qwen3.6-27B-NVFP4&lt;/code&gt; (~14 GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallelism&lt;/td&gt;
&lt;td&gt;tensor-parallel (TP=2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quantization&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;compressed-tensors&lt;/code&gt; wrapping NVFP4 (Blackwell-native 4-bit float)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache&lt;/td&gt;
&lt;td&gt;FP8 E4M3 (8 bits)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max context&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16,384&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Attention backend&lt;/td&gt;
&lt;td&gt;FLASHINFER&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CUDA graphs&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;disabled&lt;/strong&gt; (&lt;code&gt;--enforce-eager&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefix caching&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunked prefill&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vllm/vllm-openai:latest&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two forced choices here deserve a note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--enforce-eager&lt;/code&gt;&lt;/strong&gt; because CUDA graph capture for NVFP4 plus VLM weights plus KV cache exhausts the 15.48 GiB budget before KV init even starts. Skipping graph capture costs about 10 to 15% throughput, which becomes part of the fair comparison: on this hardware class vLLM gives up one of its own optimizations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;maxModelLen: 16384&lt;/code&gt;&lt;/strong&gt; is not "the model's ceiling". It is what fits after NVFP4 weights (14 GB / 2 = 7 GB/shard), vision encoder (~2 GB), KV cache at FP8, and activations. 32K OOMs during profile; 16K fits with about 1 GiB headroom.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Workloads
&lt;/h3&gt;

&lt;p&gt;Five patterns × four concurrency levels per runtime:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Shape&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chat&lt;/td&gt;
&lt;td&gt;128-in / 256-out, 20 prompts&lt;/td&gt;
&lt;td&gt;Interactive baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;coding&lt;/td&gt;
&lt;td&gt;1K-in / 1K-out, 20 prompts&lt;/td&gt;
&lt;td&gt;Typical code-gen turn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;long_context&lt;/td&gt;
&lt;td&gt;~5K-in / 1K-out, 10 prompts&lt;/td&gt;
&lt;td&gt;Code review, RAG-heavy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;long_context_extreme&lt;/td&gt;
&lt;td&gt;~43K-in / 1K-out, 10 prompts&lt;/td&gt;
&lt;td&gt;vLLM's 16K cap cannot attempt this&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agentic&lt;/td&gt;
&lt;td&gt;4K shared prefix + 512 delta / 512-out, 20 prompts&lt;/td&gt;
&lt;td&gt;Stresses prefix caching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Concurrency &lt;code&gt;1, 4, 16, 64&lt;/code&gt;. Per cell: 2 min warmup (discarded) + 5 min measurement. Temperature 0, seed 42, streaming on.&lt;/p&gt;

&lt;p&gt;The full workload matrix is 40 cells (5 × 4 × 2 runtimes). We run 36 of them. &lt;code&gt;long_context_extreme&lt;/code&gt; is not attempted on vLLM because its 16K cap would reject every prompt before submission. That asymmetry is one of the bake-off's findings, not a methodology gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Results: throughput and latency
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Single-request latency (c=1)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pattern&lt;/th&gt;
&lt;th&gt;llama.cpp TTFT p50&lt;/th&gt;
&lt;th&gt;vLLM TTFT p50&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chat&lt;/td&gt;
&lt;td&gt;208 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;157 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;coding&lt;/td&gt;
&lt;td&gt;413 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;106 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agentic&lt;/td&gt;
&lt;td&gt;911 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;409 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;long_context (5K)&lt;/td&gt;
&lt;td&gt;2,279 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;581 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;vLLM is faster at single-request latency across the board, typically 2 to 4× on prefill-heavy patterns. llama.cpp plus TurboQuant pays a prefill tax: compressing the KV cache to about 3 bits per element is memory-cheap and compute-expensive. On short prompts the gap is narrow; on long prompts it opens up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantization caveat:&lt;/strong&gt; these numbers compare Q4_K_M (llama.cpp) against NVFP4 (vLLM). They are not the same quantization, and on this hardware there is no apples-to-apples option: llama.cpp doesn't ship an NVFP4 runtime, and Q4_K_M has no vLLM implementation. We've filled out a side-by-side output-quality check in &lt;a href="https://github.com/defilantech/llmkube-bench/blob/main/docs/QUALITY-GATE.md" rel="noopener noreferrer"&gt;QUALITY-GATE.md&lt;/a&gt; so readers can judge whether the two quants produce comparable answers at this parameter count. Read the speed numbers as "at each runtime's native quant on this hardware," not "at identical model quality."&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput under load (c=64)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pattern&lt;/th&gt;
&lt;th&gt;llama.cpp tok/s&lt;/th&gt;
&lt;th&gt;vLLM tok/s&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chat&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;345&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.7×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;coding&lt;/td&gt;
&lt;td&gt;133 (60% success)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;377&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.8×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agentic&lt;/td&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;262&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.6×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is vLLM's home turf. PagedAttention plus continuous batching turn 64 concurrent requests into about 90% GPU utilization; llama.cpp's slot-based scheduling (even with 16 parallel slots) serializes far more aggressively. The coding c=64 drop to 60% success on llama.cpp is KV cache saturation: at 16 slots by about 2K per-slot context, heavy coding prompts overflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inter-token latency
&lt;/h3&gt;

&lt;p&gt;Stable and tight on both runtimes. Median ITL:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp:&lt;/strong&gt; 49 to 175 ms/token across patterns and concurrencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM:&lt;/strong&gt; 64 to 67 ms/token across patterns and concurrencies (remarkably flat, because continuous batching amortizes decode across the batch)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The llama.cpp ITL spread widens at high concurrency as slot contention kicks in. vLLM's is basically a constant, which is what makes it good for conversational workloads where you care about per-token cadence.&lt;/p&gt;

&lt;h3&gt;
  
  
  The honest version
&lt;/h3&gt;

&lt;p&gt;vLLM wins the throughput axis. That's a real result, not a function of tuning. On 2× 16 GB consumer hardware with Qwen3.6-27B, &lt;strong&gt;if you're trying to maximize requests per second, vLLM is the answer&lt;/strong&gt;, and it wins while giving up about 10 to 15% of its own throughput to &lt;code&gt;--enforce-eager&lt;/code&gt; (disabled CUDA graphs were required to fit VRAM). The NVFP4 kernels on Blackwell, PagedAttention's batching, and continuous prefill scheduling all compound even with that handicap.&lt;/p&gt;

&lt;p&gt;Except…&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Results: context
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The 5K baseline
&lt;/h3&gt;

&lt;p&gt;Both runtimes serve &lt;code&gt;long_context&lt;/code&gt; (about 5K input tokens, 1K output) at c=1 in about 13 seconds end-to-end. llama.cpp measures 20 tok/s, vLLM 19 tok/s. &lt;strong&gt;Near parity&lt;/strong&gt; at this context size.&lt;/p&gt;

&lt;p&gt;At higher concurrency the story differs because we configured llama.cpp with &lt;code&gt;parallelSlots=1&lt;/code&gt; to give every request the full 65K context (required for the extreme pattern, see below). Concurrency c=16 and c=64 on llama.cpp show queue saturation: the harness sends 16 or 64 concurrent requests, but the server processes them serially. That's not a throughput measurement, it's a queue measurement. On production llama.cpp with &lt;code&gt;parallelSlots=16&lt;/code&gt; and a smaller per-request context, short-prompt throughput would match our earlier numbers, but then you can't serve 43K prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which brings us to the real test
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;long_context_extreme: a roughly 43,000-token prompt in, 1024 tokens out.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vLLM, as configured here, can't attempt this.&lt;/strong&gt; Its &lt;code&gt;maxModelLen&lt;/code&gt; is 16K, set that way because 32K OOMs during graph capture on this hardware. A 43K-token request is rejected before it reaches inference. We did not explore &lt;code&gt;--swap-space&lt;/code&gt; CPU offload, which in principle could trade a lot of latency for more context; that's a follow-up. Out of the box on 2× 16 GB consumer cards with Qwen3.6-27B NVFP4, we did not find an in-memory configuration that serves 43K.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;llama.cpp plus TurboQuant served it.&lt;/strong&gt; One sample captured at c=16 end-to-end:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt tokens: about 43,000&lt;/li&gt;
&lt;li&gt;Prefill time (TTFT): &lt;strong&gt;186 seconds&lt;/strong&gt; (3.1 min)&lt;/li&gt;
&lt;li&gt;Decode rate: &lt;strong&gt;171 ms/token&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Output: 1024 tokens in about 175 seconds&lt;/li&gt;
&lt;li&gt;Total wall time: about 6 minutes per request&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not fast. It's not meant to be fast. What it is, is &lt;strong&gt;possible&lt;/strong&gt;. TurboQuant's roughly 3-bit KV cache makes the memory math work where FP16 or FP8 KV can't. On the same hardware, at the same moment, one runtime cannot attempt the workload and the other completes it.&lt;/p&gt;

&lt;p&gt;The higher-concurrency cells for this pattern hit our harness's 300s per-request timeout because decode plus prefill combined exceeds 300s. Bumping the harness timeout to 600s would capture all four c-levels cleanly; that's a follow-up. The c=1 and c=16 samples are enough to prove the capability.&lt;/p&gt;

&lt;h3&gt;
  
  
  The real tradeoff
&lt;/h3&gt;

&lt;p&gt;Throughput versus context is the tradeoff, not "vLLM is better" or "llama.cpp is better". On this hardware:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production chat, interactive coding, short agentic loops&lt;/strong&gt; (≤ 8K context): &lt;strong&gt;vLLM.&lt;/strong&gt; 3 to 4× throughput, lower TTFT, better ITL stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-document review, RAG with full-file context, overnight batch agentic on 40K+ codebases&lt;/strong&gt; (&amp;gt; 16K context): &lt;strong&gt;llama.cpp plus TurboQuant.&lt;/strong&gt; Slower per token, but it's the only runtime that serves the workload at all.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For many real workloads the answer is "run both." vLLM for the chat endpoint, llama.cpp for the batch endpoint that processes whole PRs overnight.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. What it costs
&lt;/h2&gt;

&lt;p&gt;Throughput numbers are interesting. Dollars per token are what actually get budgets approved.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt; is our companion tool: a Kubernetes operator that reads real-time GPU power draw from DCGM, combines it with hardware amortization and electricity rates declared on a &lt;code&gt;CostProfile&lt;/code&gt; CR, and computes the real cost of inference. It discovers inference pods by the &lt;code&gt;inference.llmkube.dev/model&lt;/code&gt; label LLMKube stamps on each Deployment, scrapes each pod's &lt;code&gt;/metrics&lt;/code&gt; endpoint directly (no Prometheus required), and writes cost attribution into a &lt;code&gt;UsageReport&lt;/code&gt; custom resource.&lt;/p&gt;

&lt;p&gt;Here's a live &lt;code&gt;UsageReport&lt;/code&gt; status from shadowstack, captured after a 10-minute mixed workload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;$ kubectl -n bench get usagereport bench-window -o yaml&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-23"&lt;/span&gt;
  &lt;span class="na"&gt;periodStart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-23T00:00:00Z"&lt;/span&gt;
  &lt;span class="na"&gt;periodEnd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-23T21:21:42Z"&lt;/span&gt;
  &lt;span class="na"&gt;inputTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;638&lt;/span&gt;
  &lt;span class="na"&gt;outputTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;12400&lt;/span&gt;
  &lt;span class="na"&gt;activeEnergyKWh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="m"&gt;0.645&lt;/span&gt;
  &lt;span class="na"&gt;activeHoursInPeriod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4.53&lt;/span&gt;
  &lt;span class="na"&gt;totalHoursInPeriod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;21.36&lt;/span&gt;
  &lt;span class="na"&gt;utilizationPercent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;21.20&lt;/span&gt;
  &lt;span class="na"&gt;estimatedCostUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;             &lt;span class="m"&gt;0.83&lt;/span&gt;
  &lt;span class="na"&gt;costPerMillionTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;         &lt;span class="m"&gt;63.79&lt;/span&gt;   &lt;span class="c1"&gt;# amortized&lt;/span&gt;
  &lt;span class="na"&gt;marginalCostPerMillionTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;3.96&lt;/span&gt;   &lt;span class="c1"&gt;# electricity during active serving&lt;/span&gt;
  &lt;span class="na"&gt;byModel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;qwen36-27b-llamacpp&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bench&lt;/span&gt;
    &lt;span class="na"&gt;inputTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;638&lt;/span&gt;
    &lt;span class="na"&gt;outputTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;12400&lt;/span&gt;
    &lt;span class="na"&gt;costPerMillionTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;63.79&lt;/span&gt;
    &lt;span class="na"&gt;estimatedCostUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.83&lt;/span&gt;
  &lt;span class="na"&gt;byNamespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bench&lt;/span&gt;
    &lt;span class="na"&gt;tokenCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;13038&lt;/span&gt;
    &lt;span class="na"&gt;estimatedCostUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.83&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The numbers look alarming at first: &lt;strong&gt;$63.79/MTok amortized&lt;/strong&gt; for a tiny workload against a day's worth of hardware amortization. That's the point. At 21.2% utilization over this window, amortized is &lt;strong&gt;16× higher than marginal&lt;/strong&gt;. Scale up the utilization and the amortized number drops toward the marginal one; that's what the bench window numbers below capture.&lt;/p&gt;

&lt;p&gt;The full bench window (Apr 23, 2026, 00:00 UTC → 10:07 UTC, ~10 hours), from &lt;code&gt;summary.csv&lt;/code&gt; cross-referenced with the &lt;code&gt;CostProfile&lt;/code&gt; spec:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total input tokens&lt;/td&gt;
&lt;td&gt;2,518,242&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total output tokens&lt;/td&gt;
&lt;td&gt;1,233,143&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;3,751,385&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active GPU energy&lt;/td&gt;
&lt;td&gt;0.459 kWh&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Utilization (active hours / wall-clock hours)&lt;/td&gt;
&lt;td&gt;32.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total dollar cost (amortization + electricity)&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Hardware amortization on the &lt;code&gt;CostProfile&lt;/code&gt; spec: 2× RTX 5060 Ti at $480 each = $960, 3-year useful life, 5% annual maintenance. Electricity $0.08/kWh, PUE 1.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  The two numbers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Which question it answers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;&lt;code&gt;costPerMillionTokens&lt;/code&gt;&lt;/strong&gt; (amortized)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.13&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What did my hardware cost per token I served today?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;marginalCostPerMillionTokens&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.010&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What did the electricity actually cost to generate those tokens?"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both numbers are correct. They answer different questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amortized $0.13/MTok&lt;/strong&gt; spreads the full cost of hardware ownership (amortization, idle electricity, active electricity) across whatever tokens you served today. It tells you the answer to "was today's inference worth what we paid for the hardware?" At 32.7% utilization, you're leaving about two-thirds of the compute capacity you already bought idle, and the amortized rate reflects that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Marginal $0.010/MTok&lt;/strong&gt; includes only the electricity drawn during active serving. It answers "what did these specific tokens cost me beyond what I'd be paying anyway?", the relevant comparison when cloud APIs only bill marginally.&lt;/p&gt;

&lt;p&gt;The 13× gap between them is the entire FinOps conversation. At 100% utilization the two numbers converge; at low utilization they diverge by more than an order of magnitude. Neither is the "right" number. They describe different things.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Cloud comparison
&lt;/h2&gt;

&lt;p&gt;Cloud APIs bill marginally. That's how they work: no inference, no invoice. So the fair comparison against on-prem is &lt;strong&gt;marginal versus marginal&lt;/strong&gt;. Cloud prices below are &lt;strong&gt;output token pricing&lt;/strong&gt; on public pricing pages as of April 2026; check each provider for current rates and input-vs-output splits.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider / Model&lt;/th&gt;
&lt;th&gt;Output $/MTok&lt;/th&gt;
&lt;th&gt;On-prem ratio (marginal)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;shadowstack marginal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.010&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI GPT-4o&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;1,000× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;1,000× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Claude Opus 4.5&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;2,500× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Those ratios are almost offensive. They're also the upper bound — the &lt;strong&gt;ceiling of savings if you saturated this hardware&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The floor, at the bench window's 32.7% utilization (i.e., our actual mixed-workload cost over ten hours), uses the amortized number:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider / Model&lt;/th&gt;
&lt;th&gt;Output $/MTok&lt;/th&gt;
&lt;th&gt;On-prem ratio (amortized at 32.7%)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;shadowstack amortized&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.13&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI GPT-4o&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;77× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;77× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Claude Opus 4.5&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;192× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Even the worst case, amortized cost at 32.7% utilization, is &lt;strong&gt;77× cheaper than GPT-4o or Gemini 2.5 Pro&lt;/strong&gt; on output tokens. Against Claude Opus 4.5 (Anthropic's flagship large-frontier model), on-prem is 192× cheaper dollars-for-dollars. Those numbers do narrow on a blended input-plus-output basis, but the direction doesn't change.&lt;/p&gt;

&lt;p&gt;For context on the hardware investment: $960 of GPUs pays for itself in Opus 4.5 output tokens at roughly &lt;strong&gt;38.4 million tokens of traffic&lt;/strong&gt;. At a modest 100K output tokens a day that's about a year; at 1M output tokens a day (a small agentic coding team), it's under six weeks. Against GPT-4o or Gemini 2.5 Pro the break-even point is 96M output tokens: ~2.6 years at 100K/day, ~3 months at 1M/day. Input tokens are cheaper on every cloud model, so a realistic blended workload stretches those numbers modestly, but not by an order of magnitude.&lt;/p&gt;

&lt;p&gt;This math is why enterprises with serious inference budgets are re-examining on-prem. It's not about paranoia or data residency (though those help). It's that the marginal economics on modern consumer GPUs, with the right runtime, genuinely work.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Reproduce it yourself
&lt;/h2&gt;

&lt;p&gt;Everything is in the public repo: &lt;a href="https://github.com/defilantech/llmkube-bench" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube-bench&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Requires: K8s cluster with LLMKube v0.7+, 2× NVIDIA 16+ GB, DCGM exporter,&lt;/span&gt;
&lt;span class="c"&gt;# hf-token Secret in the bench namespace.&lt;/span&gt;
git clone https://github.com/defilantech/llmkube-bench.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llmkube-bench
make &lt;span class="nb"&gt;install&lt;/span&gt;                                      &lt;span class="c"&gt;# Python deps via uv&lt;/span&gt;
make bench &lt;span class="nv"&gt;RESULTS_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;results/&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%F&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="nt"&gt;-myhw&lt;/span&gt;   &lt;span class="c"&gt;# ~3-4 hours for full matrix&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the workstation path. The bench also runs &lt;strong&gt;fully in-cluster&lt;/strong&gt; — a Kaniko Job builds the harness image, a bench-runner Job with a scoped ServiceAccount orchestrates the runtime swaps, results land on a hostPath volume. See &lt;code&gt;manifests/bench-runner/README.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every number in this post traces to a row in &lt;code&gt;results/2026-04-23-shadowstack/summary.csv&lt;/code&gt;. Every manifest, every image digest, every Prometheus snapshot is committed.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. What's next
&lt;/h2&gt;

&lt;p&gt;A few things we'd do differently on the next bench:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raise the harness per-request timeout&lt;/strong&gt; from 300s to 600s so &lt;code&gt;long_context_extreme&lt;/code&gt; at higher concurrencies captures cleanly. The one sample we got is defensible; four clean samples would be better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test with Qwen's own FP4 release&lt;/strong&gt; once they ship one. The &lt;code&gt;sakamakismile&lt;/code&gt; community NVFP4 has been solid for the throughput measurements, but an official Qwen FP4 would remove a variable from the methodology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-node llama.cpp&lt;/strong&gt; would close the long-context throughput gap. Splitting layers across 4 GPUs instead of 2 gives per-shard VRAM headroom for higher &lt;code&gt;--parallel&lt;/code&gt; settings and cuts the TurboQuant prefill time roughly in half.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the big-picture answer is already here. On $800 of consumer GPUs, you can serve the same day's flagship open-source model, at either throughput that crushes cloud APIs or context lengths that no cloud provider offers at any price. And InferCost shows you the honest dollar math instead of the misleading single-number dashboards you'd get from every "AI observability" tool on the market.&lt;/p&gt;

&lt;p&gt;If you want to follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube&lt;/a&gt; — the Kubernetes operator running both runtimes in this bench&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/defilantech/infercost" rel="noopener noreferrer"&gt;github.com/defilantech/infercost&lt;/a&gt; — the cost attribution controller producing the $/MTok numbers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/defilantech/llmkube-bench" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube-bench&lt;/a&gt; — the full reproducible bench&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/defilan" rel="noopener noreferrer"&gt;@defilan on X&lt;/a&gt; — where the threads go&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this was useful, star the repos. If it was wrong about something, open an issue; the goal is accurate numbers, not winning arguments.&lt;/p&gt;

&lt;p&gt;— Chris&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>LLMKube Now Deploys Any Inference Engine, Not Just llama.cpp</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Wed, 08 Apr 2026 01:03:15 +0000</pubDate>
      <link>https://dev.to/defilan/llmkube-now-deploys-any-inference-engine-not-just-llamacpp-fpm</link>
      <guid>https://dev.to/defilan/llmkube-now-deploys-any-inference-engine-not-just-llamacpp-fpm</guid>
      <description>&lt;p&gt;LLMKube started as a Kubernetes operator for llama.cpp. You define a Model, define an InferenceService, and the controller handles GPU scheduling, health probes, model downloads, and Prometheus metrics. It works well for GGUF models.&lt;/p&gt;

&lt;p&gt;But llama.cpp isn't the only inference engine. vLLM has PagedAttention. TGI has continuous batching. PersonaPlex does real-time voice AI. Triton serves multi-framework models. Locking the operator to one runtime limits what you can deploy.&lt;/p&gt;

&lt;p&gt;v0.6.0 changes that with pluggable runtime backends.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Before v0.6.0, the controller's &lt;code&gt;constructDeployment()&lt;/code&gt; was hardcoded to llama.cpp. Container name, image, command-line args, health probes, model provisioning, everything assumed llama.cpp. If you wanted to deploy vLLM, you had to create a manual Kubernetes Deployment outside of LLMKube.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;RuntimeBackend&lt;/code&gt; interface that each inference engine implements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;RuntimeBackend&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ContainerName&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;DefaultImage&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;DefaultPort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;
    &lt;span class="n"&gt;BuildArgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isvc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;modelPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;BuildProbes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;liveness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;readiness&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;NeedsModelInit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller calls &lt;code&gt;resolveBackend(isvc)&lt;/code&gt; based on the &lt;code&gt;runtime&lt;/code&gt; field in the CRD, then delegates all container configuration to the backend. llama.cpp is the default. New runtimes register in a simple switch statement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing It: PersonaPlex on Kubernetes
&lt;/h2&gt;

&lt;p&gt;To prove the architecture works, I deployed NVIDIA's PersonaPlex on my home lab. PersonaPlex is a 7B speech-to-speech model based on Moshi. It listens and talks at the same time. Sub-300ms latency for interruptions. Completely different from llama.cpp: PyTorch runtime, WebSocket-based health checks, model downloaded via HuggingFace token.&lt;/p&gt;

&lt;p&gt;The InferenceService CRD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;voice-ai&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex-7b&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.defilan.net/personaplex:7b-v1-4bit-cuda13&lt;/span&gt;
  &lt;span class="na"&gt;personaPlexConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;quantize4Bit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;hfTokenSecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hf-token&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_TOKEN&lt;/span&gt;
  &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8998&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NodePort&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kubectl apply&lt;/code&gt; and it's running. The controller:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sets the container command to &lt;code&gt;python -m moshi.server&lt;/code&gt; (via the PersonaPlex backend's &lt;code&gt;CommandBuilder&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Configures TCP socket probes on port 8998 (PersonaPlex uses WebSockets, not HTTP /health)&lt;/li&gt;
&lt;li&gt;Injects &lt;code&gt;HF_TOKEN&lt;/code&gt; from a Kubernetes Secret and &lt;code&gt;NO_TORCH_COMPILE&lt;/code&gt; env var&lt;/li&gt;
&lt;li&gt;Skips the model download init container (model downloads at startup via HF Hub)&lt;/li&gt;
&lt;li&gt;Requests 1 GPU with 32Gi memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: real-time voice conversation running on a single RTX 5060 Ti, managed by the same operator that handles my llama.cpp text inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-in vLLM Runtime
&lt;/h2&gt;

&lt;p&gt;vLLM is probably the most requested inference engine in the Kubernetes ecosystem. v0.6.0 ships it as a first-class runtime with typed &lt;code&gt;VLLMConfig&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-tinyllama&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tinyllama-1b&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm/vllm-openai:cu130-nightly&lt;/span&gt;
  &lt;span class="na"&gt;skipModelInit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;vllmConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;maxModelLen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;
    &lt;span class="na"&gt;dtype&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;float16&lt;/span&gt;
    &lt;span class="na"&gt;hfTokenSecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hf-token&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_TOKEN&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller generates the right args (&lt;code&gt;--model&lt;/code&gt;, &lt;code&gt;--tensor-parallel-size&lt;/code&gt;, &lt;code&gt;--max-model-len&lt;/code&gt;, &lt;code&gt;--quantization&lt;/code&gt;, &lt;code&gt;--dtype&lt;/code&gt;), configures HTTP &lt;code&gt;/health&lt;/code&gt; probes on port 8000, and injects HF_TOKEN from a Secret. I tested this on my cluster with TinyLlama-1.1B and got a working OpenAI-compatible endpoint in under two minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-in TGI Runtime
&lt;/h2&gt;

&lt;p&gt;HuggingFace's Text Generation Inference also ships as a built-in runtime. TGI downloads models directly from HuggingFace Hub, so &lt;code&gt;skipModelInit&lt;/code&gt; isn't even needed. The &lt;code&gt;TGIConfig&lt;/code&gt; supports quantization methods (bitsandbytes, gptq, awq, eetq), max token limits, and dtype.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Generic Runtime
&lt;/h2&gt;

&lt;p&gt;Not every inference engine needs first-class support. The &lt;code&gt;generic&lt;/code&gt; runtime lets you deploy any container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;generic&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-custom-server:latest&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/app/serve"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;skipModelInit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;probeOverrides&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;startup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tcpSocket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You provide the image, args, probes, and env. The controller handles GPU scheduling, service creation, and lifecycle management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Runtime Autoscaling
&lt;/h2&gt;

&lt;p&gt;Each runtime defines its default HPA metric via the &lt;code&gt;HPAMetricProvider&lt;/code&gt; interface. When you enable autoscaling without specifying a metric, the controller picks the right one for your runtime:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;llama.cpp: &lt;code&gt;llamacpp:requests_processing&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;vLLM: &lt;code&gt;vllm:num_requests_running&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;TGI: &lt;code&gt;tgi:queue_size&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No more hardcoded metric names.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding Your Own Runtime
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;docs/adding-a-runtime.md&lt;/code&gt; documents the full process: implement the &lt;code&gt;RuntimeBackend&lt;/code&gt; interface, optionally add &lt;code&gt;CommandBuilder&lt;/code&gt;, &lt;code&gt;EnvBuilder&lt;/code&gt;, or &lt;code&gt;HPAMetricProvider&lt;/code&gt;, register in the switch statement, add your CRD config struct, and run &lt;code&gt;make manifests generate&lt;/code&gt;. The pattern is established with five working examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Everything Else in v0.6.0
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CUDA 13 default image for RTX 50-series and Qwen3.5 support&lt;/li&gt;
&lt;li&gt;Custom GPU layer splits for multi-GPU sharding&lt;/li&gt;
&lt;li&gt;Helm image registry/repository separation for air-gapped deployments&lt;/li&gt;
&lt;li&gt;Grafana inference metrics dashboard (tokens/sec, queue depth, KV cache, reconcile health)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;imagePullSecrets&lt;/code&gt; on InferenceService for private registries&lt;/li&gt;
&lt;li&gt;HPA autoscaling for InferenceService&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Triton Inference Server and Ollama as built-in runtimes. Better Model controller support for non-GGUF formats (HuggingFace repo IDs as sources). And potentially Kubernetes-native voice AI pipelines combining PersonaPlex with LLMKube-managed reasoning models.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;https://github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>I tested speculative decoding on my home GPU cluster. Here's why it didn't help.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 06 Apr 2026 03:51:51 +0000</pubDate>
      <link>https://dev.to/defilan/i-tested-speculative-decoding-on-my-home-gpu-cluster-heres-why-it-didnt-help-3ej6</link>
      <guid>https://dev.to/defilan/i-tested-speculative-decoding-on-my-home-gpu-cluster-heres-why-it-didnt-help-3ej6</guid>
      <description>&lt;p&gt;I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel.&lt;/p&gt;

&lt;p&gt;I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp.&lt;/p&gt;

&lt;p&gt;For this test I deployed two models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4 26B-A4B&lt;/strong&gt;: Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs at 88 tok/s on my setup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3-32B&lt;/strong&gt;: A dense 32B model. All parameters active per token. Runs at 20 tok/s.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both running Q4_K_M quantization, flash attention enabled, 8K context, split across both GPUs.&lt;/p&gt;

&lt;p&gt;Quick note on why the MoE model is so much faster: Gemma 4 only activates a fraction of its parameters per token, so there's way less weight data to read from VRAM on each forward pass. MoE routing overhead eats into some of that advantage, but it's still a huge win on bandwidth-constrained hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tested
&lt;/h2&gt;

&lt;p&gt;llama.cpp has built-in n-gram speculative decoding. No draft model needed, you just pass a few flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--spec-type&lt;/span&gt; ngram-mod
&lt;span class="nt"&gt;--draft-max&lt;/span&gt; 64
&lt;span class="nt"&gt;--draft-min&lt;/span&gt; 48
&lt;span class="nt"&gt;--spec-ngram-size-n&lt;/span&gt; 24
&lt;span class="nt"&gt;--spec-ngram-size-m&lt;/span&gt; 48
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How it works: llama.cpp builds an n-gram lookup table from the recent context (both the input prompt and generated output so far). When it spots a pattern it's seen before, it speculatively drafts the next several tokens and verifies them in a single forward pass. If the predictions are right, you get multiple tokens for the cost of one.&lt;/p&gt;

&lt;p&gt;Important: this is specifically n-gram speculative decoding, not draft-model approaches like EAGLE-3 or Medusa. Those use a separate trained model to generate speculations. N-gram lookup is simpler and doesn't require any extra model files.&lt;/p&gt;

&lt;p&gt;With LLMKube, switching between configs is just updating the &lt;code&gt;extraArgs&lt;/code&gt; field in the InferenceService CRD and letting the operator restart the pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b-a4b&lt;/span&gt;
  &lt;span class="na"&gt;extraArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--spec-type"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ngram-mod"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--draft-max"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I tested two variants: &lt;code&gt;ngram-simple&lt;/code&gt; (basic lookup) and &lt;code&gt;ngram-mod&lt;/code&gt; (the variant recommended for MoE models in the llama.cpp docs).&lt;/p&gt;

&lt;h2&gt;
  
  
  The result that fooled me
&lt;/h2&gt;

&lt;p&gt;My first test ran the same prompt 10 times in a row. The numbers looked incredible:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;tok/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (cold)&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;105.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;112.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;186.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;336.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;419.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Almost 5x speedup by run 10. I was ready to write a very different article.&lt;/p&gt;

&lt;p&gt;Then I ran 8 different prompts. Code generation, API design, Go functions, bash scripts, technical explanations. Real variety.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Baseline (tok/s)&lt;/th&gt;
&lt;th&gt;+ ngram-mod (tok/s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BST implementation&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;94.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K8s operator explanation&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU monitoring script&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;87.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;REST API design&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GGUF parser in Go&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallelism explainer&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Benchmark script&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helm chart design&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Median&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88.2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Zero improvement. The 419 tok/s "speedup" was the n-gram cache memorizing repeated output patterns. With diverse prompts, there's nothing useful to cache.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same story on the dense model
&lt;/h2&gt;

&lt;p&gt;Qwen3-32B showed the same pattern. 20.4 tok/s baseline, 20.6 tok/s with ngram-simple. Within measurement noise.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;+ ngram-simple&lt;/th&gt;
&lt;th&gt;+ ngram-mod&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;87.2 (-1.2%)&lt;/td&gt;
&lt;td&gt;88.2 (0%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-32B&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;20.4&lt;/td&gt;
&lt;td&gt;20.6 (+1%)&lt;/td&gt;
&lt;td&gt;not tested&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why it doesn't help on these GPUs
&lt;/h2&gt;

&lt;p&gt;The bottleneck on RTX 5060 Ti is memory bandwidth, not compute. Every token requires reading model weights from VRAM. Speculative decoding tries to batch multiple verification steps together, but when you're already saturating the memory bus during single-token generation, there's not enough idle compute for the speculative verification to pay for itself.&lt;/p&gt;

&lt;p&gt;This is different from high-end datacenter GPUs (A100, H100) where the compute-to-memory bandwidth ratio is much higher. An H100 has roughly 3,350 GB/s memory bandwidth but nearly 2,000 TFLOPS of FP16 compute. That ratio means there's genuine idle compute at small batch sizes that speculative decoding can exploit. Consumer GPUs don't have that same headroom.&lt;/p&gt;

&lt;p&gt;For MoE models specifically, there's an additional wrinkle. Each speculative token in a verification batch may activate different experts, which means more expert weight blocks need to be read. This reduces the batching advantage that speculative decoding relies on in dense models, where weight reads stay roughly constant regardless of batch size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caveat:&lt;/strong&gt; there are scenarios where n-gram spec decoding can help even on consumer hardware. If your model is partially CPU-offloaded (doesn't fit in VRAM), the PCIe bandwidth bottleneck is severe enough that speculative batching can provide real gains. And for highly repetitive or templated outputs (think structured JSON, boilerplate code), the n-gram cache hit rate goes way up. My testing focused on single-user inference with fully VRAM-resident models and diverse prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about EAGLE-3?
&lt;/h2&gt;

&lt;p&gt;I originally wanted to test EAGLE-3, which uses a trained draft head instead of n-gram lookup. Three problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No EAGLE-3 draft model exists for Gemma 4 (no one has trained one)&lt;/li&gt;
&lt;li&gt;The llama.cpp EAGLE-3 PR (#18039) is still open and in draft as of April 5, 2026&lt;/li&gt;
&lt;li&gt;The PR's own benchmarks show MoE models getting roughly 0.89-1.06x on certain prompts, with some actually slower due to the expert activation overhead during batch verification&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even with a trained draft head, the fundamental bandwidth constraint on consumer GPUs would remain.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually helps on consumer GPUs
&lt;/h2&gt;

&lt;p&gt;If you're running local LLMs on consumer hardware, here's what actually moves the needle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flash attention&lt;/strong&gt;: Already standard, significant memory savings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV cache quantization&lt;/strong&gt;: q4_0 or q8_0 reduces cache memory pressure without meaningful quality loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoE over dense&lt;/strong&gt;: Gemma 4 activates ~4B parameters per token vs Qwen3-32B's 32B. That's the primary driver of the throughput difference, though MoE routing overhead means the speedup isn't a clean 8x ratio.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-GPU split&lt;/strong&gt;: Doubles your available memory bandwidth, which is the actual bottleneck&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context size tuning&lt;/strong&gt;: Smaller context = less KV cache = more VRAM headroom&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The benchmarking lesson
&lt;/h2&gt;

&lt;p&gt;The biggest takeaway wasn't about speculative decoding. It was about benchmark methodology.&lt;/p&gt;

&lt;p&gt;If I'd only tested with repeated prompts, I would have reported a 4.75x speedup and been completely wrong. The n-gram cache is doing something real, but only in a narrow scenario where outputs are highly repetitive or templated. For interactive chat, coding assistance, or any workload with diverse inputs, it provides no benefit on this hardware.&lt;/p&gt;

&lt;p&gt;Be skeptical of speculative decoding benchmarks that don't disclose their prompt diversity. And if you see someone reporting huge n-gram gains, check if they're running the same prompt over and over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;Everything I tested runs on Kubernetes via LLMKube. The InferenceService CRD's &lt;code&gt;extraArgs&lt;/code&gt; field makes it trivial to swap between configs without touching your deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-spec-bench&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b-a4b&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/ggml-org/llama.cpp:server-cuda&lt;/span&gt;
  &lt;span class="na"&gt;contextSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8192&lt;/span&gt;
  &lt;span class="na"&gt;flashAttention&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;extraArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--spec-type"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ngram-mod"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--draft-max"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64"&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLMKube is open source, Apache 2.0: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>kubernetes</category>
      <category>gpu</category>
      <category>ai</category>
    </item>
    <item>
      <title>Google Released Gemma 4 Yesterday. I Had It Fixing Real Bugs by Lunch.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Fri, 03 Apr 2026 16:34:48 +0000</pubDate>
      <link>https://dev.to/defilan/google-released-gemma-4-yesterday-i-had-it-fixing-real-bugs-by-lunch-cp0</link>
      <guid>https://dev.to/defilan/google-released-gemma-4-yesterday-i-had-it-fixing-real-bugs-by-lunch-cp0</guid>
      <description>&lt;p&gt;Google released Gemma 4 yesterday. By the time I went to bed, I had it deployed on my home lab, running real coding benchmarks at 96 tokens per second.&lt;/p&gt;

&lt;p&gt;The catch: no official llama.cpp image supported the &lt;code&gt;gemma4&lt;/code&gt; architecture yet. The stock CUDA images crash with &lt;code&gt;unknown model architecture: 'gemma4'&lt;/code&gt;. So I built it from source, on the same Kubernetes cluster that serves inference.&lt;/p&gt;

&lt;p&gt;This post is about what it took to go from "model dropped" to "running in production" in about two hours on consumer hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;My home inference server (I call it ShadowStack):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x NVIDIA RTX 5060 Ti (16GB each, 32GB total VRAM)&lt;/li&gt;
&lt;li&gt;AMD Ryzen 9 7900X, 64GB DDR5&lt;/li&gt;
&lt;li&gt;Ubuntu 24.04, MicroK8s&lt;/li&gt;
&lt;li&gt;NVIDIA driver 590.48.01 (CUDA 13.1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is managed by &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt;, a Kubernetes operator I built for running llama.cpp inference. One CRD to define the model, one CRD to define the service, the operator handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: The Architecture Problem
&lt;/h2&gt;

&lt;p&gt;First attempt, I tried the &lt;code&gt;server-cuda13&lt;/code&gt; image (CUDA 13 build of llama.cpp):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Gemma 4 architecture hadn't shipped in any released llama.cpp build yet. The support was only in HEAD.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Build From HEAD On-Cluster
&lt;/h2&gt;

&lt;p&gt;I have a Kaniko build pipeline on the cluster from a previous project (TurboQuant benchmarking). I wrote a Dockerfile that clones llama.cpp HEAD and builds with CUDA targeting SM 86 (Ampere) and SM 120 (Blackwell):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;nvidia/cuda:12.8.0-devel-ubuntu24.04&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 https://github.com/ggml-org/llama.cpp.git /build/llama.cpp

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /build/llama.cpp&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /usr/local/cuda/lib64/stubs/libcuda.so &lt;span class="se"&gt;\
&lt;/span&gt;          /usr/local/cuda/lib64/stubs/libcuda.so.1
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"86;120"&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--target&lt;/span&gt; llama-server &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Kaniko Job on the cluster built this in about 15 minutes and pushed it to my local container registry. The same cluster that runs inference also builds its own inference server. No external CI needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Deploy
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llmkube deploy gemma4-26b &lt;span class="nt"&gt;--gpu&lt;/span&gt; &lt;span class="nt"&gt;--accelerator&lt;/span&gt; cuda &lt;span class="nt"&gt;--gpu-count&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; https://huggingface.co/Trilogix1/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; registry.defilan.net/llama-server-latest:gemma4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="nt"&gt;--jinja&lt;/span&gt; &lt;span class="nt"&gt;--context&lt;/span&gt; 32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model is 15.6 GB at Q4_K_M. With both GPUs, that leaves about 16 GB for KV cache. Plenty for 32K context.&lt;/p&gt;

&lt;p&gt;The operator downloaded the model, created the Deployment with the right GPU flags, set up health probes, and exposed an OpenAI-compatible endpoint. From the deploy command to the first inference request was about 3 minutes (mostly model download time).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Single Request
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation&lt;/td&gt;
&lt;td&gt;96 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt processing&lt;/td&gt;
&lt;td&gt;128 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model size (Q4_K_M)&lt;/td&gt;
&lt;td&gt;15.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active parameters per token&lt;/td&gt;
&lt;td&gt;4B (MoE)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Under Load (4 concurrent workers, 2 minutes)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate throughput&lt;/td&gt;
&lt;td&gt;170 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total requests&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rate&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P50 latency&lt;/td&gt;
&lt;td&gt;~2s per request&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For context, the generic benchmarks floating around say Gemma 4 26B-A4B "exceeds 40 tok/s on consumer hardware." We're doing 96 tok/s on a single request and 170 tok/s aggregate under concurrent load. The dual-GPU split and the MoE architecture (only 4B parameters active per token) make this model surprisingly fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Coding Benchmarks
&lt;/h2&gt;

&lt;p&gt;I didn't just run "hello world" tests. I fed it actual bug reports from my own project and asked it to generate fixes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug: GPU Rolling Update Deadlock
&lt;/h3&gt;

&lt;p&gt;The issue: Kubernetes rolling updates deadlock on GPU workloads because the new pod can't schedule (old pod holds GPUs) and the old pod won't terminate (waiting for new pod to be Ready).&lt;/p&gt;

&lt;p&gt;Gemma 4's response: correctly identified that GPU workloads should use &lt;code&gt;Recreate&lt;/code&gt; strategy instead of &lt;code&gt;RollingUpdate&lt;/code&gt;, with a conditional check on GPU count. Showed the chain-of-thought reasoning, considered edge cases, and verified against the pattern before outputting.&lt;/p&gt;

&lt;p&gt;Time: 10.6 seconds for a 1024-token response including the full reasoning chain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug: Stale Endpoints After Deletion
&lt;/h3&gt;

&lt;p&gt;The issue: deleting an InferenceService leaves orphaned Kubernetes Endpoints.&lt;/p&gt;

&lt;p&gt;Gemma 4's response: generated a complete &lt;code&gt;UnregisterEndpoint&lt;/code&gt; method with DNS name sanitization, Service and Endpoints deletion, &lt;code&gt;NotFound&lt;/code&gt; error handling, and logging. Production-quality Go code on the first try.&lt;/p&gt;

&lt;p&gt;Time: 11.1 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Generation: Ginkgo BDD Tests
&lt;/h3&gt;

&lt;p&gt;I asked it to write tests following an existing pattern in the codebase. It generated 4 correct test cases with &lt;code&gt;BeforeEach&lt;/code&gt; setup, proper assertions, and the right Gomega matchers. Used &lt;code&gt;ContainElements&lt;/code&gt; for present checks and &lt;code&gt;NotTo(ContainElement())&lt;/code&gt; for absent checks, matching the exact conventions from the rest of the test suite.&lt;/p&gt;

&lt;p&gt;Time: 12.3 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;I'm not claiming Gemma 4 replaces Claude or GPT-4. It doesn't. The reasoning is shallower on complex multi-step problems, and it occasionally cuts off mid-response at the token limit.&lt;/p&gt;

&lt;p&gt;What I am claiming: the gap between "Google releases a new model" and "it's running on your hardware fixing real bugs" has shrunk to hours, not weeks. The pieces are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GGUF quantization appears on HuggingFace within hours of a model release&lt;/li&gt;
&lt;li&gt;llama.cpp HEAD usually has architecture support on day one (the tokenizer and template fixes were already committed)&lt;/li&gt;
&lt;li&gt;Kaniko or similar tools let you build from source on-cluster without a separate CI pipeline&lt;/li&gt;
&lt;li&gt;A Kubernetes operator (in my case, LLMKube) lets you deploy with one command and get health checks, metrics, and an OpenAI-compatible API&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the same workflow regardless of whether the model is Gemma 4, Qwen3.5, Llama, or whatever ships next week. The infrastructure is model-agnostic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardware Math
&lt;/h2&gt;

&lt;p&gt;This entire setup cost about $2,400:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x RTX 5060 Ti: ~$800&lt;/li&gt;
&lt;li&gt;Ryzen 9 7900X + motherboard + RAM + SSD + case + PSU: ~$1,600&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running 24/7, the system draws about 50-60W idle and 500-600W under full inference load. At $0.12/kWh, that's roughly $30-50/month in electricity for unlimited inference.&lt;/p&gt;

&lt;p&gt;Compare to API costs: at OpenAI's pricing for a comparable model, 110 requests in 2 minutes would cost roughly $5-10. Scale that to continuous use and the hardware pays for itself in a month or two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;LLMKube is open source (Apache 2.0): &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have a GPU and a Kubernetes cluster (even a single-node K3s or MicroK8s), you can deploy any GGUF model with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;llmkube llmkube/llmkube
llmkube deploy llama-3.1-8b &lt;span class="nt"&gt;--gpu&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Gemma 4 specifically, you'll need a custom llama.cpp image until the official builds ship with &lt;code&gt;gemma4&lt;/code&gt; architecture support. The Dockerfile above works.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Benchmarks run on April 2, 2026 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.1, driver 590.48.01). Gemma 4 26B-A4B-it Q4_K_M via llama.cpp built from HEAD commit f851fa5a.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>llm</category>
      <category>homelab</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Tested TurboQuant KV Cache Compression on Consumer GPUs. Here's What Actually Happened.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 30 Mar 2026 15:12:24 +0000</pubDate>
      <link>https://dev.to/defilan/i-tested-turboquant-kv-cache-compression-on-consumer-gpus-heres-what-actually-happened-beg</link>
      <guid>https://dev.to/defilan/i-tested-turboquant-kv-cache-compression-on-consumer-gpus-heres-what-actually-happened-beg</guid>
      <description>&lt;p&gt;I spent this weekend testing TurboQuant KV cache compression on my home lab Kubernetes cluster. The paper (ICLR 2026, Google Research) promises up to 4.57x compression of the KV cache with minimal quality loss. That sounded like exactly what I needed. I'm always bumping up against VRAM limits trying to run larger models or longer contexts on consumer hardware.&lt;/p&gt;

&lt;p&gt;Here's what I found: it works, but there are real tradeoffs nobody's talking about yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: KV Cache Eats Your VRAM
&lt;/h2&gt;

&lt;p&gt;If you've run LLMs locally, you know the drill. You load a 32B model that fits in 20GB of VRAM, set the context to 32K, and suddenly you're at 28GB. The model weights didn't change. It's the KV cache growing linearly with context length.&lt;/p&gt;

&lt;p&gt;For every token in the context, the model stores key and value vectors for every attention head at every layer. In FP16, that adds up fast. A 32B model at 32K context can burn through 8+ GB of VRAM just for the KV cache.&lt;/p&gt;

&lt;p&gt;TurboQuant's approach is to apply a Walsh-Hadamard Transform (WHT) rotation to KV cache vectors before quantizing them to 3 bits. The rotation "gaussianizes" the distribution, making scalar quantization much more effective. The result is TQ3_0: roughly 3 bits per element instead of 16, for a theoretical 4.57x compression.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;: ShadowStack, my home inference server&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x NVIDIA RTX 5060 Ti (16GB GDDR7 each, 32GB total)&lt;/li&gt;
&lt;li&gt;AMD Ryzen 9 7900X, 64GB DDR5&lt;/li&gt;
&lt;li&gt;Ubuntu 24.04, MicroK8s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software&lt;/strong&gt;: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt;, an open-source Kubernetes operator I built for managing llama.cpp inference workloads. It handles model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics through Kubernetes CRDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TurboQuant build&lt;/strong&gt;: I used the &lt;a href="https://github.com/animehacker/llama-turboquant" rel="noopener noreferrer"&gt;animehacker/llama-turboquant&lt;/a&gt; fork, which has working CUDA kernels for the WHT-based TQ3_0 type. This is a Stage 1 implementation (no QJL residual correction from the full paper). I built it with Kaniko directly on my cluster targeting SM 86 (Ampere) and SM 120 (Blackwell).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Wrapper Entrypoint Pattern
&lt;/h3&gt;

&lt;p&gt;LLMKube's InferenceService CRD doesn't have a &lt;code&gt;--cache-type&lt;/code&gt; flag yet, so I built a custom Docker image with a wrapper entrypoint that injects the TurboQuant flags transparently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# entrypoint.sh - passes through all LLMKube args, appends TQ flags&lt;/span&gt;
&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;tq3_0&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;true&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;llama-server &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else
    &lt;/span&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;llama-server &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using &lt;code&gt;exec&lt;/code&gt; is important. It makes llama-server PID 1 so Kubernetes health probes and signal handling work correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Methodology
&lt;/h2&gt;

&lt;p&gt;Apples-to-apples. Same model weights, same context size, same concurrency. The only variable was the KV cache type (FP16 vs TQ3_0). Flash attention was enabled for all tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput test&lt;/strong&gt;: 5 minutes of sustained load at 4 concurrent requests, 8K context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context sweep&lt;/strong&gt;: Deploy at each context size (4K through 131K), run a 2-minute stress test, record VRAM via nvidia-smi.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models tested&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 8B (Q5_K_M), small model with lots of headroom&lt;/li&gt;
&lt;li&gt;Qwen 2.5 14B (Q5_K_M), medium model that fills one GPU&lt;/li&gt;
&lt;li&gt;Qwen 2.5 32B (Q4_K_M), large model that requires both GPUs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results: Throughput
&lt;/h2&gt;

&lt;p&gt;This is where TurboQuant hurts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Gen tok/s&lt;/th&gt;
&lt;th&gt;Prompt tok/s&lt;/th&gt;
&lt;th&gt;Requests (5min)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 8B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;565.5&lt;/td&gt;
&lt;td&gt;771&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 8B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 14B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;122.0&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 14B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;63.4&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 32B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;133.3&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 32B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.5&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Generation throughput dropped 5-6x across all models. Prompt processing dropped roughly 2-6x depending on model size. This is consistent with what the PR benchmarks showed on CPU, but I expected Blackwell's tensor cores to help more than they did. The animehacker CUDA kernels were optimized for Ampere (SM 86), not Blackwell (SM 120), so there's likely performance left on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: VRAM Usage
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Llama 3.1 8B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;6.4 GB&lt;/td&gt;
&lt;td&gt;10.1 GB&lt;/td&gt;
&lt;td&gt;-58% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;6.9 GB&lt;/td&gt;
&lt;td&gt;14.3 GB&lt;/td&gt;
&lt;td&gt;-107% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;8.0 GB&lt;/td&gt;
&lt;td&gt;22.8 GB&lt;/td&gt;
&lt;td&gt;-185% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;10.1 GB&lt;/td&gt;
&lt;td&gt;6.9 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;14.3 GB&lt;/td&gt;
&lt;td&gt;8.4 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;98K&lt;/td&gt;
&lt;td&gt;18.5 GB&lt;/td&gt;
&lt;td&gt;9.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;47% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;131K&lt;/td&gt;
&lt;td&gt;22.7 GB&lt;/td&gt;
&lt;td&gt;11.2 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 14B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;11.1 GB&lt;/td&gt;
&lt;td&gt;16.7 GB&lt;/td&gt;
&lt;td&gt;-50% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;11.9 GB&lt;/td&gt;
&lt;td&gt;23.0 GB&lt;/td&gt;
&lt;td&gt;-93% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;13.4 GB&lt;/td&gt;
&lt;td&gt;11.0 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;16.6 GB&lt;/td&gt;
&lt;td&gt;11.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;29% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;22.8 GB&lt;/td&gt;
&lt;td&gt;13.7 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 32B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2K&lt;/td&gt;
&lt;td&gt;19.9 GB&lt;/td&gt;
&lt;td&gt;23.7 GB&lt;/td&gt;
&lt;td&gt;-19% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;20.5 GB&lt;/td&gt;
&lt;td&gt;27.9 GB&lt;/td&gt;
&lt;td&gt;-36% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;21.6 GB&lt;/td&gt;
&lt;td&gt;19.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;23.7 GB&lt;/td&gt;
&lt;td&gt;20.3 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;28.0 GB&lt;/td&gt;
&lt;td&gt;21.4 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Surprise: TQ Uses MORE VRAM at Small Contexts
&lt;/h2&gt;

&lt;p&gt;I wasn't expecting this. At 4K-16K context, TQ3_0 consistently used more VRAM than the FP16 baseline. Sometimes dramatically more. Llama 8B at 16K context used 22.8 GB with TQ vs 8.0 GB with FP16.&lt;/p&gt;

&lt;p&gt;My theory: the WHT rotation machinery has a fixed overhead (lookup tables, rotation matrices, codebooks) that gets allocated regardless of context size. When the KV cache is small, this overhead dwarfs the compression savings. The crossover point where TQ starts winning varies by model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama 8B: around 32K context&lt;/li&gt;
&lt;li&gt;Qwen 14B: around 16K context&lt;/li&gt;
&lt;li&gt;Qwen 32B: around 8K context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Larger models cross over earlier because their per-token KV cache is larger (more layers, more attention heads), so the compression pays off sooner.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Is TurboQuant Worth It?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use TQ3_0 when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need 32K+ context on consumer GPUs&lt;/li&gt;
&lt;li&gt;You're hitting VRAM limits and can't afford more hardware&lt;/li&gt;
&lt;li&gt;Throughput isn't critical (batch processing, RAG with long documents, analysis tasks)&lt;/li&gt;
&lt;li&gt;You're running a large model (32B+) where the crossover point is lower&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Don't use TQ3_0 when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context is under 16K (you'll actually use more VRAM)&lt;/li&gt;
&lt;li&gt;You need interactive throughput (the 5x penalty makes chat unusable)&lt;/li&gt;
&lt;li&gt;You're on Blackwell and want optimal performance (wait for SM 120-optimized kernels)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sweet spot in my testing was Qwen 32B at 32K context. Baseline uses 28 GB, which is dangerously close to my 32 GB ceiling. One concurrent request could OOM. TQ drops it to 21.4 GB, leaving over 10 GB of headroom for parallel slots or longer contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The throughput penalty is the main blocker. The animehacker CUDA kernels use a fused MMVQ approach that avoids dequantization during attention, but the WHT butterfly transform still runs 160 integer ops per element in registers. On Blackwell with its new SM architecture, these kernels likely aren't hitting optimal occupancy.&lt;/p&gt;

&lt;p&gt;Things I'm watching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/ggml-org/llama.cpp/pull/21089" rel="noopener noreferrer"&gt;PR #21089&lt;/a&gt; on ggml-org/llama.cpp, the only open upstream PR for TurboQuant (CPU-only for now)&lt;/li&gt;
&lt;li&gt;Whether &lt;code&gt;ggerganov&lt;/code&gt; engages with it. If he requests changes rather than closing, it'll eventually land.&lt;/li&gt;
&lt;li&gt;SM 120-optimized CUDA kernels. Blackwell has new instructions that could close the throughput gap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For LLMKube, I'm planning to add &lt;code&gt;cacheTypeK&lt;/code&gt; and &lt;code&gt;cacheTypeV&lt;/code&gt; fields to the InferenceService CRD so users can configure this without the wrapper entrypoint hack. Also an &lt;code&gt;extraArgs&lt;/code&gt; escape hatch for any llama.cpp flag we don't have a typed field for yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;All the benchmarking infrastructure is in the &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt; repo. The operator is open source (Apache 2.0) and handles the full lifecycle: model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics. If you have a GPU cluster and want to test TurboQuant:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build the custom image from &lt;code&gt;animehacker/llama-turboquant&lt;/code&gt; with &lt;code&gt;-DGGML_CUDA=ON&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;spec.image&lt;/code&gt; on your InferenceService to point at it&lt;/li&gt;
&lt;li&gt;The wrapper entrypoint handles the rest&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you run these benchmarks on different hardware (A100, RTX 3090, etc.), I'd love to see the numbers. Drop a comment or find me on the &lt;a href="https://discord.gg/5GavYFPBBr" rel="noopener noreferrer"&gt;LLMKube Discord&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Benchmarks run on 2026-03-30 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.0).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>kubernetes</category>
      <category>gpu</category>
      <category>ai</category>
    </item>
    <item>
      <title>The $0 Problem: Why Every Tool Says Your On-Prem Inference is Free</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 23 Mar 2026 15:49:13 +0000</pubDate>
      <link>https://dev.to/defilan/the-0-problem-why-every-tool-says-your-on-prem-inference-is-free-3mcb</link>
      <guid>https://dev.to/defilan/the-0-problem-why-every-tool-says-your-on-prem-inference-is-free-3mcb</guid>
      <description>&lt;p&gt;If you run LLMs on your own hardware, every cost tracking tool in the ecosystem has the same answer for what it costs: &lt;strong&gt;$0&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;OpenCost sees your GPU pods but has no concept of tokens. LiteLLM tracks tokens per user but hardcodes on-prem cost to zero. Langfuse traces requests but only prices cloud APIs. The FinOps Foundation's own working group explicitly says on-premises AI cost is "outside the scope."&lt;/p&gt;

&lt;p&gt;Meanwhile, your GPUs cost real money. The H100s draw 700 watts each. Your electricity bill is real. The three-year amortization on $280K of hardware is real. But no tool computes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;true cost per token = (hardware amortization + electricity x GPU power draw) / tokens per hour
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We built InferCost to fix this.&lt;/p&gt;

&lt;h3&gt;
  
  
  What InferCost does
&lt;/h3&gt;

&lt;p&gt;InferCost is an open-source Kubernetes operator (Apache 2.0) that computes the true cost of running AI inference on your own hardware. It's a single controller pod. No database, no UI to host. It plugs into Prometheus and Grafana you already run.&lt;/p&gt;

&lt;p&gt;You declare your hardware economics in a CRD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;finops.infercost.ai/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CostProfile&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-cluster&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hardware&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpuModel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NVIDIA&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;GeForce&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;RTX&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5060&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Ti"&lt;/span&gt;
    &lt;span class="na"&gt;gpuCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;purchasePriceUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;960&lt;/span&gt;
    &lt;span class="na"&gt;amortizationYears&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;electricity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ratePerKWh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.08&lt;/span&gt;
    &lt;span class="na"&gt;pueFactor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;InferCost reads real-time GPU power draw from DCGM, scrapes token counts from your inference engine (llama.cpp, vLLM), does the math, and tells you what your inference actually costs. Per model. Per team. Per token.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we found on real hardware
&lt;/h3&gt;

&lt;p&gt;We deployed InferCost on a homelab running Qwen3-32B on 2x RTX 5060 Ti GPUs. Here are the real numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hourly infrastructure cost&lt;/strong&gt;: $0.053 (amortization + electricity at actual GPU power draw)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per million tokens&lt;/strong&gt;: $0.41 under sustained load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monthly projected&lt;/strong&gt;: $38&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then we compared against cloud APIs (verified pricing as of March 2026):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Cloud Cost&lt;/th&gt;
&lt;th&gt;On-Prem Cost&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;$9.82&lt;/td&gt;
&lt;td&gt;$0.62&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;td&gt;$5.83&lt;/td&gt;
&lt;td&gt;$0.62&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;$3.84&lt;/td&gt;
&lt;td&gt;$0.62&lt;/td&gt;
&lt;td&gt;84%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4-nano&lt;/td&gt;
&lt;td&gt;$0.41&lt;/td&gt;
&lt;td&gt;$0.62&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloud 24% cheaper&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last row matters. When the cheapest cloud model is actually cheaper than your hardware, InferCost tells you. The point is not to prove on-prem always wins. The point is to give you the real numbers so you can decide.&lt;/p&gt;

&lt;h3&gt;
  
  
  A note on how we calculate cost
&lt;/h3&gt;

&lt;p&gt;The $28/month on-prem number is your total infrastructure cost: hardware amortization plus electricity, running 24/7. Your GPUs cost money whether or not they're serving requests. The $0.41 per million tokens is the marginal cost during active inference (what each token costs when the system is busy).&lt;/p&gt;

&lt;p&gt;The savings comparison uses total infrastructure cost because that's the honest number. If your GPUs sit idle half the time, that idle time still costs you. This is the same logic as any hardware TCO calculation: you amortize the full purchase price, not just the hours you used it.&lt;/p&gt;

&lt;p&gt;This means your actual savings percentage depends on utilization. At high utilization (GPUs busy most of the day), the savings are dramatic. At low utilization, the math shifts toward cloud APIs for cheap models. InferCost shows you both realities so you can make the right call for each workload.&lt;/p&gt;

&lt;h3&gt;
  
  
  The CLI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;defilantech/tap/infercost
&lt;span class="nv"&gt;$ &lt;/span&gt;infercost compare &lt;span class="nt"&gt;--monthly&lt;/span&gt;

PROVIDER    MODEL              CLOUD/MONTH  ON-PREM/MONTH  SAVINGS/MONTH
Anthropic   claude-opus-4-6    &lt;span class="nv"&gt;$409&lt;/span&gt;         &lt;span class="nv"&gt;$28&lt;/span&gt;            &lt;span class="nv"&gt;$381&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;93%&lt;span class="o"&gt;)&lt;/span&gt;
OpenAI      gpt-5.4            &lt;span class="nv"&gt;$242&lt;/span&gt;         &lt;span class="nv"&gt;$28&lt;/span&gt;            &lt;span class="nv"&gt;$214&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;88%&lt;span class="o"&gt;)&lt;/span&gt;
Google      gemini-2.5-pro     &lt;span class="nv"&gt;$159&lt;/span&gt;         &lt;span class="nv"&gt;$28&lt;/span&gt;            &lt;span class="nv"&gt;$131&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;82%&lt;span class="o"&gt;)&lt;/span&gt;
Google      gemini-2.5-flash   &lt;span class="nv"&gt;$40&lt;/span&gt;          &lt;span class="nv"&gt;$28&lt;/span&gt;            &lt;span class="nv"&gt;$12&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;30%&lt;span class="o"&gt;)&lt;/span&gt;
OpenAI      gpt-5.4-nano       &lt;span class="nv"&gt;$20&lt;/span&gt;          &lt;span class="nv"&gt;$28&lt;/span&gt;            -&lt;span class="nv"&gt;$8&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;cloud cheaper&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What InferCost is NOT
&lt;/h3&gt;

&lt;p&gt;It is not a cloud API cost tracker. If you want to monitor your OpenAI bill, tools like Helicone and LangSmith do that well. InferCost solves a different problem: the cost of running inference on hardware you own, where the economics involve amortization schedules and electricity bills, not API invoices.&lt;/p&gt;

&lt;p&gt;It is also not locked to any specific inference stack. It works with LLMKube, but also with any Kubernetes deployment that runs llama.cpp or vLLM with Prometheus metrics exposed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why open source
&lt;/h3&gt;

&lt;p&gt;The organizations that need on-prem cost tracking the most (healthcare, defense, finance, government) are the same ones that can't send cost data to a SaaS dashboard. They chose on-prem for data sovereignty. A cost tracking tool that phones home defeats the purpose.&lt;/p&gt;

&lt;p&gt;InferCost runs entirely in your cluster. Your cost data never leaves your infrastructure. Apache 2.0, no telemetry, no cloud dependency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Get started
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;defilantech/tap/infercost

&lt;span class="c"&gt;# Or deploy via Helm&lt;/span&gt;
helm repo add infercost https://defilantech.github.io/infercost
helm &lt;span class="nb"&gt;install &lt;/span&gt;infercost infercost/infercost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; dcgm.endpoint&lt;span class="o"&gt;=&lt;/span&gt;http://dcgm-exporter:9400/metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/defilantech/infercost" rel="noopener noreferrer"&gt;github.com/defilantech/infercost&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Website&lt;/strong&gt;: &lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;infercost.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Companion project&lt;/strong&gt;: &lt;a href="https://llmkube.com" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt; (K8s operator for LLM inference)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're running inference on your own hardware and want to know what it actually costs, give it a try. Issues and PRs welcome.&lt;/p&gt;

</description>
      <category>finops</category>
      <category>ai</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>llama.cpp on Kubernetes: The Guide I Wish Existed</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Tue, 17 Mar 2026 06:50:51 +0000</pubDate>
      <link>https://dev.to/defilan/llamacpp-on-kubernetes-the-guide-i-wish-existed-59nm</link>
      <guid>https://dev.to/defilan/llamacpp-on-kubernetes-the-guide-i-wish-existed-59nm</guid>
      <description>&lt;p&gt;It started at my kitchen table.&lt;/p&gt;

&lt;p&gt;I was spending an evening on my laptop, fascinated by how LLMs actually work under the hood. Not the API calls, not the chat interfaces, but the actual inference process. I installed Ollama on my Mac, pulled a model, and within a few hours I was completely hooked.&lt;/p&gt;

&lt;p&gt;If you've done this yourself, you know the feeling. A language model running on your own hardware. No API keys, no usage limits, no data leaving your network. Just you and the model.&lt;/p&gt;

&lt;p&gt;Ollama made it easy to get started, but I quickly wanted to understand what was happening underneath. That led me to llama.cpp, which Ollama uses under the hood, and that's where things really clicked. I could see exactly how the model was being loaded, how layers were offloaded to the GPU, how the inference loop worked. I went from curious to obsessed pretty quickly.&lt;/p&gt;

&lt;p&gt;But then the questions started piling up.&lt;/p&gt;

&lt;p&gt;How do I serve this to my team? How do I run multiple models? What happens when I want to use the NVIDIA GPUs on my Linux server AND the Metal GPU on my Mac? How do I monitor it? How do I manage model versions?&lt;/p&gt;

&lt;p&gt;I come from a DevOps background, so my brain immediately went to Kubernetes. I figured someone had already built this. And while there are some incredible tools out there (Ollama for single-machine use, vLLM for high-throughput NVIDIA clusters), nothing quite did what I wanted: a Kubernetes operator that treats LLM inference as a first-class workload across heterogeneous hardware, including Apple Silicon.&lt;/p&gt;

&lt;p&gt;So I started building &lt;a href="https://github.com/defilantech/LLMKube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt;, an open-source Kubernetes operator for running LLMs with llama.cpp. I'm a big believer in open source, and I wanted this to be open source from day one. The best infrastructure tools are built by communities, not individuals. This guide is everything I've learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We're Building Toward
&lt;/h2&gt;

&lt;p&gt;By the end of this guide, you'll understand how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run llama.cpp on Kubernetes with proper lifecycle management&lt;/li&gt;
&lt;li&gt;Deploy models with a single command or a two-resource YAML&lt;/li&gt;
&lt;li&gt;Use NVIDIA GPUs with CUDA acceleration&lt;/li&gt;
&lt;li&gt;Use Apple Silicon Macs as GPU inference nodes in your cluster&lt;/li&gt;
&lt;li&gt;Split models across multiple GPUs for larger models&lt;/li&gt;
&lt;li&gt;Monitor everything with Prometheus and Grafana&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you just want to try it out quickly, skip ahead to the hands-on quickstart.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with "Just Run llama.cpp"
&lt;/h2&gt;

&lt;p&gt;llama.cpp is an outstanding project. It runs on virtually any hardware, supports dozens of model architectures, and the GGUF format has become the standard for local inference. If you need to run one model on one machine, llama.cpp with llama-server is honestly all you need.&lt;/p&gt;

&lt;p&gt;The challenges show up when you want to operationalize it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model lifecycle.&lt;/strong&gt; You need to download models, verify their integrity, cache them so pods don't re-download 30GB files on every restart, and keep track of what's deployed where.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU scheduling.&lt;/strong&gt; If you have multiple models competing for limited GPU memory, you need something smarter than "first pod wins." Priority queues, memory budgets, and graceful handling of GPU contention all matter when you have real workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Heterogeneous hardware.&lt;/strong&gt; This is the big one. Apple Silicon's Metal GPU can't be accessed from inside a container. Every Kubernetes-based LLM tool I found either ignored Macs entirely or ran them in CPU-only mode, which throws away the best part of the hardware. If you have a Mac Studio with an M4 Ultra sitting on your desk and a Linux server with NVIDIA GPUs in your closet, you shouldn't have to choose between them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability.&lt;/strong&gt; If you're already running Prometheus and Grafana (and if you're running Kubernetes, you probably are), you want inference metrics in the same stack as everything else. Tokens per second, prompt processing time, GPU utilization, model load times, all in one place.&lt;/p&gt;

&lt;h2&gt;
  
  
  How LLMKube Approaches This
&lt;/h2&gt;

&lt;p&gt;LLMKube adds two Custom Resource Definitions to your Kubernetes cluster:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model&lt;/strong&gt; defines what you want to run: the GGUF source URL, quantization level, GPU requirements, and hardware preferences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;InferenceService&lt;/strong&gt; defines how you want to run it: replicas, resource limits, endpoint configuration, and which Model to reference.&lt;/p&gt;

&lt;p&gt;The operator watches these resources and handles everything in between: downloading the model, creating deployments, configuring health checks, setting up llama-server with the right flags, exposing an OpenAI-compatible API, and cleaning up when you delete resources.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Model&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llama-3-8b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf&lt;/span&gt;
  &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gguf&lt;/span&gt;
  &lt;span class="na"&gt;quantization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Q4_K_M&lt;/span&gt;
  &lt;span class="na"&gt;hardware&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;accelerator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cuda&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llama-3-8b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llama-3-8b&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The operator takes it from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Actual Setup
&lt;/h2&gt;

&lt;p&gt;I want to be transparent about the hardware I run this on, because I think it's important for people to see that you don't need datacenter-grade equipment to make this work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadowstack&lt;/strong&gt; is my primary inference server. It's a desktop PC I built specifically for this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AMD Ryzen 9 7900X (12 cores / 24 threads)&lt;/li&gt;
&lt;li&gt;64GB DDR5-6000&lt;/li&gt;
&lt;li&gt;2x NVIDIA RTX 5060 Ti (16GB VRAM each, 32GB total)&lt;/li&gt;
&lt;li&gt;Samsung 990 Pro 1TB NVMe&lt;/li&gt;
&lt;li&gt;Running MicroK8s as a single-node Kubernetes cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mac Studio&lt;/strong&gt; (M4 Ultra, 36GB unified memory) runs the Metal Agent, which lets Kubernetes orchestrate llama-server natively on macOS with full Metal GPU access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mac Mini&lt;/strong&gt; handles other orchestration workloads.&lt;/p&gt;

&lt;p&gt;On Shadowstack, I run &lt;strong&gt;Qwen3 32B&lt;/strong&gt; with the model split across both 5060 Tis using tensor parallelism. On the Mac Studio, I run &lt;strong&gt;Qwen 30B-A3B&lt;/strong&gt; (a mixture-of-experts model that fits comfortably in 36GB of unified memory). Both are managed by the same LLMKube operator, using the same CRDs, visible through the same monitoring stack.&lt;/p&gt;

&lt;p&gt;Is 36GB of unified memory on the Mac Studio less than I wish I had? Sure. But it still runs a 30B MoE model for real workloads, and that's the point. You work with the hardware you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metal Agent: Running Apple Silicon in Your Cluster
&lt;/h2&gt;

&lt;p&gt;This is the part that gets me the most excited, and the part that I haven't seen anyone else solve.&lt;/p&gt;

&lt;p&gt;Here's the core problem: Apple Silicon GPUs use Metal, not CUDA. Metal isn't accessible from inside a Docker container. So if you put a Mac in your Kubernetes cluster and deploy a pod to it, that pod can only use the CPU. Your M4 Ultra's GPU sits idle.&lt;/p&gt;

&lt;p&gt;The Metal Agent works around this by inverting the typical Kubernetes model. Instead of running inference inside a container, the Metal Agent runs as a native macOS daemon that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Watches the Kubernetes API for InferenceService resources with &lt;code&gt;accelerator: metal&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Spawns llama-server natively on macOS with full Metal GPU access&lt;/li&gt;
&lt;li&gt;Registers the endpoint back into Kubernetes so other services can route to it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From the perspective of any other service in your cluster, the model running on your Mac looks like any other Kubernetes-managed endpoint. You can hit the same OpenAI-compatible API, the same health checks work, the same Prometheus metrics are exposed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On your Mac&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;llama.cpp
llmkube-metal-agent &lt;span class="nt"&gt;--host-ip&lt;/span&gt; 192.168.1.x

&lt;span class="c"&gt;# From anywhere in the cluster&lt;/span&gt;
llmkube deploy qwen-30b-a3b &lt;span class="nt"&gt;--accelerator&lt;/span&gt; metal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same CRD that deploys a model on NVIDIA with CUDA deploys on Apple Silicon with Metal. Just change &lt;code&gt;accelerator: cuda&lt;/code&gt; to &lt;code&gt;accelerator: metal&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-GPU: Splitting Models Across Cards
&lt;/h2&gt;

&lt;p&gt;If you want to run models larger than what fits on a single GPU, llama.cpp supports tensor parallelism across multiple GPUs on the same node. LLMKube automates this through the GPU sharding spec.&lt;/p&gt;

&lt;p&gt;On my Shadowstack box, Qwen3 32B (quantized to Q4_K_M, roughly 20GB) gets split across both 5060 Tis. Each GPU handles a portion of the model's layers, and llama.cpp coordinates the inference across both cards.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hardware&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;accelerator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cuda&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
      &lt;span class="na"&gt;sharding&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;layer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The operator automatically calculates the tensor split ratios and passes the right flags to llama-server. On the dual 5060 Ti setup, I see consistent ~53 tokens/second for 3-8B models and solid performance on the 32B model with the split.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hands-On: Try It in 10 Minutes
&lt;/h2&gt;

&lt;p&gt;You don't need my hardware to try this. Here's the quickest path from zero to running inference on Kubernetes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A Kubernetes cluster (Minikube, kind, K3s, or any managed cluster)&lt;/li&gt;
&lt;li&gt;kubectl configured&lt;/li&gt;
&lt;li&gt;Helm 3&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Install LLMKube
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;defilantech/tap/llmkube

&lt;span class="c"&gt;# Add the Helm repo and install the operator&lt;/span&gt;
helm repo add llmkube https://defilantech.github.io/LLMKube
helm &lt;span class="nb"&gt;install &lt;/span&gt;llmkube llmkube/llmkube &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; llmkube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deploy Your First Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy Phi-4 Mini (3.8B params, from the built-in catalog)&lt;/span&gt;
llmkube deploy phi-4-mini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That single command creates both the Model and InferenceService resources. The operator downloads the GGUF file, spins up a pod with llama-server, and exposes an OpenAI-compatible API. You can also deploy any GGUF model by providing a &lt;code&gt;--source&lt;/code&gt; URL pointing to HuggingFace or any HTTP endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query It
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Port-forward and test&lt;/span&gt;
kubectl port-forward svc/phi-4-mini 8080:8080 &amp;amp;

curl http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "messages": [
      {"role": "user", "content": "What is Kubernetes in one sentence?"}
    ],
    "max_tokens": 100
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Use It With the OpenAI SDK
&lt;/h3&gt;

&lt;p&gt;Since the API is OpenAI-compatible, you can point any OpenAI SDK client at it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phi-4-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works with LangChain, LlamaIndex, and anything else that speaks the OpenAI API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Add GPU Acceleration
&lt;/h3&gt;

&lt;p&gt;If you have an NVIDIA GPU available in your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llmkube deploy llama-3.1-8b &lt;span class="nt"&gt;--gpu&lt;/span&gt; &lt;span class="nt"&gt;--gpu-count&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference is dramatic. On an NVIDIA L4 in GKE, prompt processing goes from 29 tok/s (CPU) to 1,026 tok/s (GPU). Token generation jumps from 4.6 tok/s to 64 tok/s. That's a 17x speedup on generation and 66x on prompt processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Air-Gapped Deployments
&lt;/h2&gt;

&lt;p&gt;Early in my career, I worked in medical IT. That experience gave me an appreciation for environments where data simply cannot leave the network. Healthcare, defense, finance, government: these industries have strict compliance requirements that make cloud-hosted AI a non-starter.&lt;/p&gt;

&lt;p&gt;LLMKube supports air-gapped deployment through PVC-based model sources with SHA256 integrity verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pvc://model-storage/models/llama-3-8b-q4.gguf&lt;/span&gt;
  &lt;span class="na"&gt;sha256&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;a1b2c3d4e5f6...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You stage models to a PersistentVolumeClaim, provide the checksum, and the operator verifies integrity before deploying. No outbound network calls, no container registry pulls at runtime, no data leaving your network.&lt;/p&gt;

&lt;p&gt;This is an area where I think llama.cpp really shines for Kubernetes deployments. The GGUF format is a single file. There's no Python dependency tree, no model sharding across dozens of files, no runtime downloads of tokenizers. You put one file on a PVC, point a CRD at it, and you're running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where LLMKube Fits (and Where It Doesn't)
&lt;/h2&gt;

&lt;p&gt;I want to be honest about this, because there are great tools in this space and picking the right one matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you need maximum throughput for high-concurrency workloads (50+ simultaneous users), use vLLM or SGLang.&lt;/strong&gt; They use PagedAttention, continuous batching, and other optimizations that llama.cpp doesn't have. At scale, vLLM delivers significantly higher request throughput. That's just the reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you just need to run one model on one machine, use Ollama.&lt;/strong&gt; It's simpler, it's elegant, and it handles the single-machine case better than a Kubernetes operator ever will.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLMKube is for the space in between.&lt;/strong&gt; You have a Kubernetes cluster. You have a mix of hardware (maybe NVIDIA GPUs, maybe Apple Silicon, maybe both). You want Kubernetes-native lifecycle management with CRDs, GitOps workflows, and your inference metrics in the same Prometheus/Grafana stack as everything else. You care about air-gapped deployments, GPU scheduling, and model versioning. You're serving a team or a set of internal workloads, not a public-facing API with thousands of concurrent users.&lt;/p&gt;

&lt;p&gt;If that sounds like your situation, LLMKube might be what you're looking for. If it doesn't, I genuinely hope one of the other tools solves your problem. We all benefit from this ecosystem getting better.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;LLMKube is open source (Apache 2.0) and actively developed. Some things I'm excited about on the roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Edge deployment support&lt;/strong&gt; for lightweight Kubernetes distributions like K3s and MicroK8s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AMD GPU support (ROCm)&lt;/strong&gt; with a community contributor already testing on Framework hardware with a Ryzen AI Max+ 395&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;llmkube chat&lt;/code&gt;&lt;/strong&gt; for testing models directly from the CLI without needing curl&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll be honest about one thing that comes up a lot: multi-node distributed inference. llama.cpp has an RPC backend that can split a model across machines over ethernet, and I've been watching it closely. The reality is that over consumer networking (1GbE, 2.5GbE), the performance hit from network round-trips makes it marginal for interactive use. Jeff Geerling tested a four-node Framework cluster and got 0.7 tok/s on Llama 405B. The tech is improving, but today my advice is to scale vertically first. Get a bigger GPU or more unified memory before trying to split across machines. If the RPC backend matures to the point where it's genuinely usable over ethernet, LLMKube will support it, but I'm not going to promise something that isn't ready.&lt;/p&gt;

&lt;p&gt;If any of this is interesting to you, I'd love to hear from you. The project is at &lt;a href="https://github.com/defilantech/LLMKube" rel="noopener noreferrer"&gt;github.com/defilantech/LLMKube&lt;/a&gt;, and we have a &lt;a href="https://discord.gg/Ktz85RFHDv" rel="noopener noreferrer"&gt;Discord&lt;/a&gt; where I hang out and talk about this stuff regularly.&lt;/p&gt;

&lt;p&gt;If you hit issues, open a GitHub issue. If you want to contribute, check the issues labeled &lt;code&gt;good-first-issue&lt;/code&gt;. And if you just want to say hi, that's cool too.&lt;/p&gt;

&lt;p&gt;Thanks for reading. I hope this saves you some of the time I spent figuring all this out.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>opensource</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
