<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Christopher Maher</title>
    <description>The latest articles on DEV Community by Christopher Maher (@defilan).</description>
    <link>https://dev.to/defilan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828578%2Fd03de6fc-1dcb-419b-b336-0d9c7d86f7cc.jpeg</url>
      <title>DEV Community: Christopher Maher</title>
      <link>https://dev.to/defilan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/defilan"/>
    <language>en</language>
    <item>
      <title>A local model opened 41 of our pull requests in five weeks. The model is the least interesting part.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Thu, 25 Jun 2026 08:11:57 +0000</pubDate>
      <link>https://dev.to/defilan/a-local-model-opened-41-of-our-pull-requests-in-five-weeks-the-model-is-the-least-interesting-part-cc4</link>
      <guid>https://dev.to/defilan/a-local-model-opened-41-of-our-pull-requests-in-five-weeks-the-model-is-the-least-interesting-part-cc4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This was originally published on the &lt;a href="https://llmkube.com/blog/a-local-model-opened-41-pull-requests" rel="noopener noreferrer"&gt;LLMKube blog&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is the claim, up front and checkable: between May 21 and June 25, 2026, a fleet of local models opened &lt;strong&gt;41 pull requests that we merged into LLMKube&lt;/strong&gt;, our open-source Kubernetes operator for self-hosted inference. No code or prompts left the building. The marginal inference cost was a few cents of electricity. Across those five weeks they were about a fifth of everything merged into the repo, and closer to half in the busiest recent stretch, sitting next to pull requests from five human contributors who showed up in the same weeks.&lt;/p&gt;

&lt;p&gt;If you have used a 27-billion-parameter open-weight model as a coding agent, your first reaction is correct skepticism. A model that size is a coin flip on a non-trivial issue. It drifts. It writes tests that do not test anything. It declares victory on code that does not compile.&lt;/p&gt;

&lt;p&gt;That is all true, and it is also beside the point. We never bet on the model. We bet on the harness around it. This post is the evidence for that bet, including the parts where it failed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup: a weak model, a strict harness, heterogeneous hardware
&lt;/h2&gt;

&lt;p&gt;The agentic coder is a component of LLMKube called Foreman. Its design premise is one sentence: &lt;strong&gt;trust the harness, not the model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model is whatever local coder we have loaded. Over these five weeks that was mostly a dense 27B (Qwopus-27B-Coder) on an AMD Strix Halo mini-PC over Vulkan, and a 35B mixture-of-experts (Qwen3.6-35B-A3B) on an Apple Silicon Mac over Metal. A second, different model on a Mac Studio acts as the reviewer. None of them is a frontier model. None of them is close.&lt;/p&gt;

&lt;p&gt;The harness is where the work went. Around every run sits a stack of &lt;strong&gt;deterministic&lt;/strong&gt; checks, each of which can reject the model's output regardless of how confident the model is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A fast in-workspace gate:&lt;/strong&gt; gofmt, go vet, go build, golangci-lint, and the scoped unit tests. If any fail, the failure is fed back to the coder for up to three fix attempts. No PR opens on a red gate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A scope-drift guard:&lt;/strong&gt; if the diff touches a subsystem the issue does not imply, the run is rejected rather than approved. A confidently-wrong change to the wrong package never reaches a PR.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A bite check:&lt;/strong&gt; every new test is run against the pre-fix baseline. If a test passes without the fix, it does not actually test the fix, and the run is rejected. This is the single most common failure mode of LLM-written tests, and it is now a gate, not a hope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An issueAsk check:&lt;/strong&gt; the reviewer has to demonstrate, against the actual fetched issue body, that it understood what was asked. A reviewer that confabulates a plausible-but-wrong summary is demoted, not trusted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A separate reviewer model&lt;/strong&gt; on separate hardware, so the thing judging the work is not the thing that produced it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The coder is stochastic. Every one of those rails is deterministic. That asymmetry is the entire product.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Five weeks. The verifiable shape of it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;41 merged PRs&lt;/strong&gt; authored by Foreman, all from &lt;code&gt;foreman/issue-*&lt;/code&gt; branches, May 21 to June 25.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;39 of them in June&lt;/strong&gt;, as the harness matured and we trusted it with more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;31 of the 41 carried no human commit at all&lt;/strong&gt;: Foreman is the only author on the branch. The other ten took a small human touch-up before merge, the same hand-finishing this post is honest about.&lt;/li&gt;
&lt;li&gt;Across the full five weeks they were about &lt;strong&gt;20% of everything merged&lt;/strong&gt; (201 PRs in total); in the most active recent stretch, closer to &lt;strong&gt;half&lt;/strong&gt;. They sat alongside five human contributors (adebrie, arychj, eleboucher, joryirving, matiasinsaurralde) working the same repo in the same weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.00 in API spend.&lt;/strong&gt; A representative overnight batch of six issues cost roughly &lt;strong&gt;six cents of electricity&lt;/strong&gt; versus an estimated eighteen to thirty cents of equivalent cloud-API tokens. The pennies are not the point. The &lt;strong&gt;shape&lt;/strong&gt; is: zero per-call cost, constant and repeatable, and nothing routed through someone else's data center.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The work was real maintenance, not toy issues: CLI flags, controller reconciliation fixes, metrics plumbing, test-coverage slices, a supply-chain CI scan, observability spans. The kind of backlog that is too important to ignore and too unglamorous to prioritize. Exactly the kind a tireless overnight coworker should take.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three times it went wrong (and what caught it)
&lt;/h2&gt;

&lt;p&gt;A case study that only reports wins is marketing. Here is where the model was not good enough, in detail, because the failures are the argument.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issue #731 took six runs to converge.&lt;/strong&gt; A single feature-plus-tests task. The mixture-of-experts coder kept getting partway and stalling. We drove it forward by tightening the harness one layer at a time: a budget guard, then an edit-streak forcing function, then a test tier. Each layer fixed one failure mode and revealed the next. The rails always contained it. It never shipped a broken change. But a 3-billion-active-parameter mixture-of-experts could not nail that task autonomously. What finally cleared it was not a sharper prompt or a human rescue, it was a different model: a denser 27B coder, on the same AMD hardware, later landed the same issue cleanly and autonomously, gate-verified, in about forty minutes. That is the thesis from another angle. The harness is the constant. The model is a dial you turn when the one you have is not enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issue #813 is still not done.&lt;/strong&gt; It is a harness change that needs a hermetic git-fixture test. Two autonomous attempts, including a second with a sharpened prompt that explicitly demanded a fixture and a fast fallback, both came back INCOMPLETE: the first hung for 180 seconds on a test that did real I/O instead of using a fixture; the second still failed the gate. After two honest tries, we marked it a human hand-finish. The harness did its job by refusing to land a failing change. The model did not do the job at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A run produced a false GO, and CI caught what the in-workspace gate could not.&lt;/strong&gt; We routed one coder loop onto an in-cluster agent that, it turned out, had no Go toolchain installed. The fast gate runs in the coder's own workspace, so with no Go it silently no-op'd, and the run reported a confident GO on code with a backwards test assertion and a formatting violation. The model was sure. The local gate was blind. The full CI suite caught both immediately, we fixed them by hand, and we filed the toolchain gap as a tracked issue. The lesson is the thesis restated: a harness is only as good as its coverage, and the moment one rail goes dark, you find out exactly how much you were leaning on it.&lt;/p&gt;

&lt;p&gt;None of these three shipped a broken PR. That is the whole claim. The model is unreliable; the system is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "harness, not model" is the right bet for weak models
&lt;/h2&gt;

&lt;p&gt;There is a tidy intuition under all of this. Generating a correct change is hard and open-ended. &lt;strong&gt;Verifying&lt;/strong&gt; one is narrow and mechanical: does it compile, do the tests bite, did the diff stay in scope, did the reviewer actually read the issue. The recent verifier literature makes the same point more formally, that a stack of weak, independent verifiers can close most of the gap to an oracle, and that the weaker your generator is, the more load-bearing your verifiers become.&lt;/p&gt;

&lt;p&gt;A frontier cloud model is good enough that you can get away with a thin harness. A local 27B is not, which is precisely why a local 27B forces you to build the harness you should have built anyway. We did not set out to prove a research point. We set out to fix our own backlog without paying for or trusting a cloud API, and the harness is what fell out of taking that constraint seriously.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cloud's other problem: the bill stopped being predictable
&lt;/h2&gt;

&lt;p&gt;There is a second reason to run the coder on hardware you own, and over 2025 and 2026 it stopped being hypothetical.&lt;/p&gt;

&lt;p&gt;The flat-rate era of cloud AI coding ended in public. OpenAI's CEO admitted in early 2025 that the company was &lt;a href="https://techcrunch.com/2025/01/05/openai-is-losing-money-on-its-pricey-chatgpt-pro-plan-ceo-sam-altman-says/" rel="noopener noreferrer"&gt;losing money on its $200-a-month ChatGPT Pro plan&lt;/a&gt; because people were using it far more than the company expected. Cursor &lt;a href="https://techcrunch.com/2025/07/07/cursor-apologizes-for-unclear-pricing-changes-that-upset-users/" rel="noopener noreferrer"&gt;apologized and issued refunds&lt;/a&gt; after a June 2025 repricing left users burning a month of credits in a single agentic session. GitHub &lt;a href="https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/" rel="noopener noreferrer"&gt;ended flat-rate Copilot billing&lt;/a&gt; on June 1, 2026, after its own product chief called the prior premium-request model "no longer sustainable"; developers posting their own bills projected typical agentic workflows costing several times more.&lt;/p&gt;

&lt;p&gt;The cause is structural, not a botched rollout. A human types for eight hours and stops. An agent has no natural ceiling: point it at a backlog and it consumes tokens until you tell it not to. Usage-metered pricing meets unbounded consumption, and the bill stops being a line you can budget.&lt;/p&gt;

&lt;p&gt;At enterprise scale that gets vivid. Uber rolled an AI coding assistant out to its engineering organization and &lt;a href="https://fortune.com/2026/05/26/uber-coo-ai-spending-tokens-claude-code/" rel="noopener noreferrer"&gt;burned through its entire 2026 AI-tools budget in four months&lt;/a&gt;, with its COO openly questioning whether the spend tied to features the company actually shipped. Microsoft, by separate reporting, began canceling Claude Code licenses across a division and steering engineers to a flat-rate tool over the per-seat-plus-tokens math. Even Meta capped internal token budgets after costs approached the billions and its CTO pushed back on the "tokenmaxxing" culture, writing that "token usage alone is not a measure of impact of any kind."&lt;/p&gt;

&lt;p&gt;On-prem inverts the whole model. The hardware is a one-time capital cost; every run after that carries a zero marginal token bill. Our 41 PRs cost the same in API dollars whether the number is 41 or 4,100, which is to say nothing. That is not a discount, it is a different axis. To be honest about it, the hardware, the power, and the operations are real total cost of ownership. The claim is narrower and it is the one that matters: the &lt;em&gt;marginal&lt;/em&gt; cost of the next agentic run, the thing that blew up Uber's budget, is gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters if you cannot use the cloud at all
&lt;/h2&gt;

&lt;p&gt;For most teams this is a cost-and-control story. For some teams it is the only story there is.&lt;/p&gt;

&lt;p&gt;GitHub Copilot and Amazon Q do not run on-premises. For an organization on GitHub Enterprise Server behind an air gap, in defense, in regulated finance, in healthcare with code that touches protected data, the dominant agentic coding tools are not a policy fight, they are architecturally unavailable. Sending source code to a third party's inference endpoint is the thing the compliance regime exists to prevent.&lt;/p&gt;

&lt;p&gt;A coder that runs entirely on hardware you own, talks only to a model you host, and emits an auditable record of every gate it passed is a different kind of object. It is agentic coding for the rooms that cloud agents cannot enter. That is the same constraint that makes LLMKube exist at all, applied to the act of building LLMKube.&lt;/p&gt;

&lt;p&gt;That auditability is not aspirational. This week we shipped a durable, exportable audit record for every Foreman run, capturing which model and endpoint served it, the verdict, and which rails fired, surviving long enough to be a real compliance trail. The harness now writes down what it checked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The crowd that already owns the hardware
&lt;/h2&gt;

&lt;p&gt;None of this is news to one group: the people who have been running local models in their own home labs for years.&lt;/p&gt;

&lt;p&gt;That community is not fringe anymore. r/LocalLLaMA crossed &lt;a href="https://gummysearch.com/r/LocalLLaMA/" rel="noopener noreferrer"&gt;757,000 members as of June 2026&lt;/a&gt;, up more than 267,000 in a single year, and the growth tracks the arrival of open-weight coding models that are genuinely good. These are not toys you settle for. Devstral Small 2, a 24B model, scores 68% on SWE-bench Verified and runs in about 15GB of VRAM, a single RTX 4090 or a 32GB Mac, by its own model card. Qwen's coder models run on the same class of hardware. Capability first; the price is a consequence.&lt;/p&gt;

&lt;p&gt;When Ethereum's Vitalik Buterin &lt;a href="https://vitalik.eth.limo/general/2026/04/02/secure_llms.html" rel="noopener noreferrer"&gt;published his own fully local inference stack&lt;/a&gt; in April 2026, running a 35B model on a single laptop GPU, his reason was not the bill. It was not wanting to "take ten steps backward" on privacy just as the tools got good. That is the same instinct underneath the enterprise compliance story: when inference runs on hardware you own, no prompt leaves the machine, no terms of service govern what gets logged, and the ten-thousandth run costs exactly what the first one did.&lt;/p&gt;

&lt;p&gt;LLMKube and Foreman are that instinct taken to production. The same hardware a hobbyist already has, plus the operator that schedules it across a fleet and the harness that makes a coin-flip model trustworthy enough to leave pointed at a real repository overnight. We are not going to tell you it is push-button. Running your own inference is a real operational surface, and this crowd knows that better than anyone. We are telling you it is worth it, and that the gate is the thing that turns "a model on my 4090" into "a coworker I can actually hand the backlog to."&lt;/p&gt;

&lt;h2&gt;
  
  
  What is next, honestly
&lt;/h2&gt;

&lt;p&gt;The most useful thing we can publish next is a number we do not have yet: the &lt;strong&gt;harness uplift&lt;/strong&gt; on a standard benchmark. Not our resolved rate versus a frontier model, a race we would lose, but the same local model's resolved rate with the rails on versus off, per hardware tier, with the false-GO rate alongside it. The delta is the product. We will run it and publish it, good or bad.&lt;/p&gt;

&lt;p&gt;Until then, the honest framing is the one we actually operate under. Foreman is a tireless coworker for the routine and the well-scoped, with a human triage queue for everything it declines. It is not a sprint that finishes itself overnight. It is a backlog that gets quietly smaller while the machines work and the gate refuses to lie.&lt;/p&gt;

&lt;p&gt;The model will keep being a coin flip. We are going to keep not betting on it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;LLMKube is Apache 2.0 and runs on NVIDIA, Apple Silicon, and AMD. Whether you are a regulated team that cannot send code to the cloud, an organization watching its Copilot bill go usage-based, or someone with a spare GPU and a backlog that never shrinks, the bet is the same: own the hardware, trust the harness. The repo, including every one of those 41 PRs, is on &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, and we are in &lt;a href="https://discord.gg/Ktz85RFHDv" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;. If you are running local models on your own hardware, we would like to hear what you are building.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>A 27B model on an AMD mini-PC fixed a bug in our operator. Then it overreached.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Wed, 24 Jun 2026 02:43:32 +0000</pubDate>
      <link>https://dev.to/defilan/a-27b-model-on-an-amd-mini-pc-fixed-a-bug-in-our-operator-then-it-overreached-39ob</link>
      <guid>https://dev.to/defilan/a-27b-model-on-an-amd-mini-pc-fixed-a-bug-in-our-operator-then-it-overreached-39ob</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/operator-fixed-its-own-bug-on-amd" rel="noopener noreferrer"&gt;llmkube.com/blog/operator-fixed-its-own-bug-on-amd&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;(LLMKube is an open-source, Apache-2.0 Kubernetes operator for self-hosted LLM inference across NVIDIA, Apple Silicon, and AMD. Foreman is its agentic harness.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://llmkube.com/blog/operator-built-its-own-feature" rel="noopener noreferrer"&gt;Last time&lt;/a&gt; Foreman built a feature for itself. This is the sequel, and it is a better story: this time it fixed a bug I had just shipped, the model doing the fixing ran on a consumer AMD box on my desk, and the most useful moment in the whole run is where the model got it &lt;em&gt;wrong&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Wiring up Claude Code against one of my own local models surfaced a real bug in the operator: a hardcoded 60-second timeout that silently capped every request regardless of what you configured.&lt;/li&gt;
&lt;li&gt;I handed it to Foreman with a &lt;strong&gt;27B dense coder (Qwopus, a Qwen3.6 distill) on a consumer AMD Strix Halo machine over Vulkan&lt;/strong&gt;. No datacenter, no NVIDIA, no cloud GPU.&lt;/li&gt;
&lt;li&gt;The model produced the correct fix and the gate confirmed it compiled, vetted, linted, and passed. &lt;strong&gt;It also quietly wrote two tests for unrelated code, and the gate passed those too.&lt;/strong&gt; I caught that on review and trimmed them. That gap is the whole point.&lt;/li&gt;
&lt;li&gt;Verified end-to-end: a request that used to die at exactly 60 seconds now completes in &lt;strong&gt;128 seconds&lt;/strong&gt;. Shipped in &lt;code&gt;0.8.16&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Every step of the loop ran on hardware I own.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. A coding agent on a model I run myself
&lt;/h2&gt;

&lt;p&gt;The fleet here is deliberately mixed: an NVIDIA box, a couple of Apple Silicon machines, an AMD Strix Halo machine with 128 GB of unified memory. The point of LLMKube is to serve models across all of it from one spec. So the obvious experiment was to stop renting a frontier model for agentic coding and point Claude Code at one of my own models, through the gateway, on my own hardware.&lt;/p&gt;

&lt;p&gt;Short tasks worked. Then I gave it a real one, and after about twelve minutes of churning it died with an opaque API error. Not the model's fault. Mine. The agent had just walked straight into a bug in the operator the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The bug it surfaced
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;ModelRouter&lt;/code&gt; compiles onto an Envoy AI Gateway, and when it generated the retry policy it hardcoded the per-attempt timeout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="s"&gt;"perRetry"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}{&lt;/span&gt;
    &lt;span class="s"&gt;"timeout"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"60s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// &amp;lt;- hardcoded&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Envoy applies that per-attempt timeout &lt;em&gt;on top of&lt;/em&gt; the route-level request timeout. So no matter that the route was set to 30 minutes for long generations, every request was silently capped at 60 seconds, and the resulting 504 was not in the retry list, so it failed outright instead of retrying. Short turns finished under a minute and looked healthy. The first turn whose prefill grew past a minute fell off the cliff.&lt;/p&gt;

&lt;p&gt;Clean, embarrassing, entirely self-inflicted. Exactly what Foreman is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Handing the bug to the fleet
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://llmkube.com/docs" rel="noopener noreferrer"&gt;Foreman&lt;/a&gt; is the agentic harness built into LLMKube: point it at an issue, a coder model works the fix in a sandbox under a set of rails, and a verification gate checks the result before anything counts. The choice worth dwelling on here is the hardware.&lt;/p&gt;

&lt;p&gt;The coder was &lt;strong&gt;Qwopus3.6-27B-Coder&lt;/strong&gt;, a dense 27B Qwen3.6 distill trained on Claude-Opus traces, 4-bit quantized, served through llama.cpp's TurboQuant fork on a &lt;strong&gt;consumer AMD Strix Halo machine&lt;/strong&gt; (gfx1151, Vulkan). Not an H100. Not even an NVIDIA card. A mini-PC class box.&lt;/p&gt;

&lt;p&gt;I filed the bug as &lt;a href="https://github.com/defilantech/LLMKube/issues/817" rel="noopener noreferrer"&gt;#817&lt;/a&gt;, scoped the task to the one file and its test, and dispatched it. The coder found the function and made the right change: a small helper that derives the per-attempt timeout from the largest timeout actually configured across the router's rules and backends, with a named-constant fallback. It wrote tests and submitted &lt;code&gt;GO&lt;/code&gt;. The gate ran &lt;code&gt;gofmt&lt;/code&gt;, &lt;code&gt;go vet&lt;/code&gt;, &lt;code&gt;go build&lt;/code&gt;, a Linux-target &lt;code&gt;golangci-lint&lt;/code&gt;, and the full suite in a clean room. All green.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Where it overreached, and the gate didn't catch it
&lt;/h2&gt;

&lt;p&gt;Here is the part the "AI fixed my bug" posts skip.&lt;/p&gt;

&lt;p&gt;The fix was correct, and its own two tests were good: one asserts the policy uses the configured 30-minute timeout, one asserts the 60-second fallback. But the model had also written &lt;strong&gt;two more tests for a completely unrelated function&lt;/strong&gt;, default-route compilation, that the change never touched. They compiled. They passed. The gate had no opinion, because they were green, and green is all a gate measures.&lt;/p&gt;

&lt;p&gt;Scope is not something a compiler or a linter can judge. That is a human call, and on review I made it: trimmed the two unrelated tests so the change was exactly the fix plus its own tests, and pushed the clean version.&lt;/p&gt;

&lt;p&gt;This is why I build harnesses instead of waiting for a bigger model. A 27B model on a desktop is not Claude, and the trick is not pretending it is. What it does reliably is produce a correct, compiling, tested change inside a gate that guarantees a verified floor. What it cannot do is judge scope and intent. The harness gives you the floor for free; you bring the judgment. &lt;strong&gt;Trust the harness, not the model.&lt;/strong&gt; That division of labor is what makes a model this size genuinely useful instead of a parlor trick.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Proof, on the cluster
&lt;/h2&gt;

&lt;p&gt;A passing test is not a working system, so I verified before believing it.&lt;/p&gt;

&lt;p&gt;I built the patched operator the way everything gets built here: in-cluster, a kaniko job on my own node, from the fix branch. Deployed it, watched the operator regenerate the gateway policy, and confirmed the per-attempt timeout flipped from &lt;code&gt;60s&lt;/code&gt; to &lt;code&gt;30m0s&lt;/code&gt; live. Then I fired the request that used to die at the one-minute mark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;HTTP 200 in 128.1s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before the fix, a 504 at exactly sixty seconds. After it, two minutes of clean generation. That is the difference between a green checkmark and a fixed bug, and it is worth the extra ten minutes every time. Shipped in &lt;code&gt;0.8.16&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Where this ran
&lt;/h2&gt;

&lt;p&gt;Look at where every piece of that loop lived. The coder: a consumer AMD box on my desk. The build: a job on my own node. The deploy and verification: my own cluster. The model behind the original coding session: an Apple Silicon machine in the same house. Nothing touched a cloud GPU, a rented datacenter fleet, or a control plane sitting above someone else's clusters.&lt;/p&gt;

&lt;p&gt;That is the thesis, not a footnote. Most tooling around self-hosted inference still assumes "self-hosted" means a rack of datacenter accelerators operated at scale. LLMKube assumes the opposite: that the frontier worth building for is making inference, and now agentic work on top of it, run well across the heterogeneous hardware you already own. Apple Silicon, consumer AMD, a couple of NVIDIA cards, an edge box at a remote site. One operator, one model spec, hardware you control.&lt;/p&gt;

&lt;p&gt;A 27B model on a mini-PC fixing a real bug in the operator that serves it, verified on the cluster, is the smallest concrete proof of that. As of today it is also a true story.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://llmkube.com/docs/getting-started" rel="noopener noreferrer"&gt;Quickstart&lt;/a&gt;: serve a model on your GPU, Apple Silicon, or AMD box in a few minutes.&lt;/li&gt;
&lt;li&gt;The fix: &lt;a href="https://github.com/defilantech/LLMKube/issues/817" rel="noopener noreferrer"&gt;#817&lt;/a&gt;, &lt;a href="https://github.com/defilantech/LLMKube/pull/818" rel="noopener noreferrer"&gt;PR #818&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The cost math behind any of this: &lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run it on hardware I haven't tried, I want to hear about it. A &lt;a href="https://github.com/defilantech/LLMKube" rel="noopener noreferrer"&gt;star&lt;/a&gt; helps more people find it.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Trust the harness, not the model: a weekend of local agents building their own guardrails</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 22 Jun 2026 15:27:10 +0000</pubDate>
      <link>https://dev.to/defilan/trust-the-harness-not-the-model-a-weekend-of-local-agents-building-their-own-guardrails-52nl</link>
      <guid>https://dev.to/defilan/trust-the-harness-not-the-model-a-weekend-of-local-agents-building-their-own-guardrails-52nl</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Cross-posted from the &lt;a href="https://llmkube.com/blog/trust-the-harness-not-the-model" rel="noopener noreferrer"&gt;LLMKube blog&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A local 27B coding model, running on hardware in my house, is a coin flip. Some runs it nails the fix in twenty minutes. Some runs it edits the wrong file, writes a test that passes no matter what the code does, and tells you it's done. The bet behind LLMKube's Foreman was never that I would find a local model good enough to trust. It was that I could build a &lt;em&gt;harness&lt;/em&gt; I trust more than any single model's output. This weekend tested that bet harder than any benchmark could, because the harness spent the weekend building its own guardrails.&lt;/p&gt;

&lt;p&gt;Here is the short version of what happened across 0.8.12 and 0.8.13. My local coder built three new gates for itself. One of them shipped with the exact flaw it was written to catch, and the review caught it. Three new contributors sent four clean pull requests while the machines worked. The same model ran on an AMD box and an Apple Silicon Mac, and the Mac quietly won a round nobody expected. And not one byte of any of it touched a cloud API.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thesis, stated plainly
&lt;/h2&gt;

&lt;p&gt;Trust the harness, not the model. A coding agent on a local model produces output of wildly variable quality, and no amount of prompt tuning makes a 27B as reliable as a frontier model. So Foreman does not ask the model to be reliable. It wraps the model in a pipeline that &lt;em&gt;is&lt;/em&gt;: the coder works in a cloned workspace, a fast in-workspace gate runs gofmt, vet, build, lint, and the unit tests for the packages it touched; a reviewer reads the diff against the issue; and a clean-room Kubernetes Job re-runs the full suite before anything is allowed to call itself a GO. Around all of that sit deterministic rails: scope checks, edit-free-streak detection, repo-map context. The model is a stochastic component inside a system whose job is to make the system's verdict trustworthy even when the component is not.&lt;/p&gt;

&lt;p&gt;The interesting question is never "is the model good." It is "does the harness catch the model when it is bad." This weekend gave me an unusually honest answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The audit that started it
&lt;/h2&gt;

&lt;p&gt;It opened with a regression. I shipped 0.8.12 and rolled it across the fleet, and the metal agent on my Macs stopped serving. The cause was a change Foreman itself had authored a few days earlier: it made the agent honor a per-service &lt;code&gt;runtime&lt;/code&gt; field, but the agent registered its llama.cpp backend under the key &lt;code&gt;llama-server&lt;/code&gt; while every InferenceService in my fleet (and the in-cluster controller, and the CRD's own default) uses the canonical value &lt;code&gt;llamacpp&lt;/code&gt;. The two halves of the codebase disagreed on a name. Backward-incompatible, and it had passed the gate, passed review, and shipped.&lt;/p&gt;

&lt;p&gt;That stung enough that I audited every PR Foreman had landed that weekend, looking for the same class of miss. I found a second one. A metrics change registered a time-to-first-token histogram and a request-error counter, complete with recording rules and a Grafana panel, that &lt;em&gt;no production code ever emitted&lt;/em&gt;. The dashboard would have shown a confident, permanent zero.&lt;/p&gt;

&lt;p&gt;Both bugs had the same shape, and it is the shape that should keep anyone running an agentic harness up at night: &lt;strong&gt;the tests passed without testing anything&lt;/strong&gt;. The runtime change was tested with a made-up runtime value, never the real one the whole fleet uses. The metrics were "tested" by a unit test that incremented the counter itself and then asserted it went up. Self-confirming tests. The gate runs the tests and they are green, so the gate is happy. The gate never asked whether the tests would fail if the code were wrong. That is the harness's blind spot, and a stochastic model will find a blind spot every time you give it enough runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  So the harness built its own guardrails
&lt;/h2&gt;

&lt;p&gt;Every catch this weekend turned into a gate. I filed three issues for the exact failure classes the audit surfaced, and then I did the thing this whole project is about: I handed them back to Foreman and let the harness build the gates that make the harness better.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A scope guard.&lt;/strong&gt; Score the issue's relevant files with the repo map, and reject a GO whose diff has zero overlap with them. This is the "you edited the wrong subsystem" catch, the one that used to need me watching.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A reviewer rubric.&lt;/strong&gt; Two new checks the reviewer must apply: do the tests use the real values the system uses in production, not placeholders; and is every new metric, flag, or field actually wired into a production path, or only touched by tests. These are the two failure classes from the audit, written down as rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A bite check.&lt;/strong&gt; The strongest one. A new or changed test must &lt;em&gt;fail&lt;/em&gt; against the pre-change code. If it passes against both the old and the new code, it is not testing the change, and the gate rejects it as a non-biting test. This is the deterministic catch for the entire self-confirming class.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first two landed clean. The coder produced both on the first try, gate-verified, on a dense 27B model running over Vulkan on an AMD Strix Halo box on my desk. The reviewer rubric is even pleasingly self-referential: its own "is this wired up" change is, in fact, wired up. I checked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part where it ate its own tail
&lt;/h2&gt;

&lt;p&gt;The bite check is where it got honest. I ran a deep review on the three branches before signing off on any of them, the same kind of adversarial review the harness runs: an isolated worktree, revert the implementation, re-run the new tests, and confirm they fail without the feature. The scope guard passed. The rubric passed. The bite check did not.&lt;/p&gt;

&lt;p&gt;The gate built to reject non-biting tests &lt;strong&gt;shipped with four of its own six tests non-biting&lt;/strong&gt;. They asserted the baseline-equivalent happy path, so they stayed green with the feature removed. The exact anti-pattern the feature exists to catch, in the feature's own test file. It also had a real correctness bug (it could not revert a brand-new production file, so it would falsely reject a legitimate new-file PR) and it had been built into the fast gate when it belonged in the clean-room Job.&lt;/p&gt;

&lt;p&gt;I want to be clear that this is not a story about the harness failing. It is the opposite. The model produced a flawed gate, and the review (which is part of the harness) caught it, cold, with empirical evidence, before a line of it merged. That is the entire thesis demonstrated at its sharpest: &lt;strong&gt;even when the model writes the harness, you trust the harness over the model.&lt;/strong&gt; I sharpened the issue with the specific fixes and sent it back for another run.&lt;/p&gt;

&lt;p&gt;The rerun, for the record, died on turn 16 to an unexpected EOF on the model's streaming connection, a transient network blip. And the harness did the right thing again: it classified the run as an infrastructure error, marked it incomplete, and pushed nothing. No half-finished branch, no false GO. A blip is not a bug, and the system knew the difference. I confirmed the model server was healthy and re-dispatched it. That is the unglamorous reliability work that makes "leave it running overnight" an actual sentence I can say.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two coders, one model, and a surprise
&lt;/h2&gt;

&lt;p&gt;The fleet running all this is heterogeneous on purpose. The coder model is a dense 27B, and this weekend I had it serving on two very different machines: an AMD Strix Halo box over Vulkan, and an Apple Silicon M5 Max over Metal. Same model, same quant, two accelerators that share almost nothing.&lt;/p&gt;

&lt;p&gt;I expected the dedicated AMD box to be the workhorse and the Mac to be the slower second lane. The early numbers say otherwise. Measured at a realistic context depth, the Mac's prompt-processing throughput came in well above what the Strix turns in on its stable configuration, and the Mac is stable where the Strix's fastest decode path falls over at long context. This is an early, deliberately un-matched read (different KV configs, not run side by side), and I will not put a clean number on it until I run them back to back. But the direction is the interesting part: on this workload the small Apple node is not the slow one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The other half of trusting the harness
&lt;/h2&gt;

&lt;p&gt;Here is the part I did not expect to be writing about. While the machines worked through the weekend, the repository did something a repository with a pulse does: other people showed up. Two contributors I had not worked with before sent three pull requests against LLMKube's router and inference APIs, a default-route strategy that kills a class of boilerplate, topology-spread and affinity passthrough for the inference pods, and a revision-history-limit knob for the deployments. All three were clean. Complete tests, both CRD copies synced, docs updated, CI green. I reviewed each one closely and the only notes I had were minor. Then, while I was literally drafting this post, a third contributor opened a fourth: a tidy fix for a Helm chart bug where setting &lt;code&gt;modelCache.enabled: false&lt;/code&gt; did not actually disable the cache. Root-caused, tested, approved. Same story.&lt;/p&gt;

&lt;p&gt;And it clicked that this is the same thesis. The gates and the review that make a coin-flip 27B trustworthy are the same gates and review that let a newcomer's pull request land clean. The harness is not an AI feature. It is the project's quality floor, and it does not care whether the diff came from a local model on my desk or a human across the internet. Sometime in the middle of all this, someone dropped a note in our Discord: "Great project folks. You just saved me two hours of debugging vllm." That is the whole point, on both ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually believe now
&lt;/h2&gt;

&lt;p&gt;You do not need a frontier model on your own hardware to do real engineering work locally. You need a harness you trust more than any single model's output. Build that, and a 27B on a desktop becomes a useful, supervised coworker, one whose mistakes are caught by a system instead of by you reading every diff at midnight. Build that, and the same system becomes the thing that lets a community build on top of you.&lt;/p&gt;

&lt;p&gt;The model produced a broken gate this weekend. The harness caught it. Three new contributors improved the project, and the harness vouched for their work the same way it vouches for the model's. That is not a contradiction to manage. That is the design working. Trust the harness, not the model.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;LLMKube is Apache 2.0 and runs on Kubernetes with NVIDIA, Apple Silicon, and AMD. The project is on &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and we are in &lt;a href="https://discord.gg/Ktz85RFHDv" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Making a fleet of self-hosted LLM agents trustworthy</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Sun, 14 Jun 2026 18:26:35 +0000</pubDate>
      <link>https://dev.to/defilan/making-a-fleet-of-self-hosted-llm-agents-trustworthy-49e4</link>
      <guid>https://dev.to/defilan/making-a-fleet-of-self-hosted-llm-agents-trustworthy-49e4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/making-self-hosted-llm-agents-trustworthy" rel="noopener noreferrer"&gt;llmkube.com/blog/making-self-hosted-llm-agents-trustworthy&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Running a single local LLM node is a solved problem. You write an InferenceService, the operator schedules it, llama.cpp or MLX serves it, and you get an OpenAI-compatible endpoint. We have been doing that for months.&lt;/p&gt;

&lt;p&gt;Running a &lt;em&gt;fleet&lt;/em&gt; of them is where it stops being easy. My fleet is heterogeneous on purpose: CUDA pods in the cluster, and Apple Silicon Macs sitting off-cluster on the homelab network, each one running two separate agents (one for inference, one for the agentic coding harness). The day I shipped 0.8.4 to that fleet, I learned exactly how it does not scale.&lt;/p&gt;

&lt;p&gt;I updated each Mac by hand. The control plane had no idea what version any agent was running. And the launchd reload I used to restart an agent was a silent no-op on an already-loaded service, so the old binary kept running while I believed I had updated it. I found that out by hand-inspecting a process tree. Three machines made it annoying. Thirty would make it impossible, and the whole pitch for sovereign, on-prem AI is that you run a lot more than three.&lt;/p&gt;

&lt;p&gt;So the last stretch of work on LLMKube was not about a faster runtime or a bigger model. It was about making the fleet &lt;em&gt;trustworthy&lt;/em&gt;: able to update itself safely, and unable to lie to the control plane about its own state. Here is what that took.&lt;/p&gt;

&lt;h2&gt;
  
  
  Helm and brew for the edge
&lt;/h2&gt;

&lt;p&gt;The fix is a new cluster-scoped CRD, &lt;code&gt;AgentRelease&lt;/code&gt;, and a self-update path in the agents themselves. You describe the release you want once, the operator rolls it out, and the agents pull and apply it. The design borrows directly from prior art that already solved this for Kubernetes nodes: Rancher's system-upgrade-controller, k0s autopilot's per-platform SHA-256 staging, and Teleport's outbound-only poll model.&lt;/p&gt;

&lt;p&gt;The properties that make it safe to leave running:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Declarative and approved.&lt;/strong&gt; An &lt;code&gt;AgentRelease&lt;/code&gt; names the agent, the version, and the per-platform artifacts (URL plus SHA-256). Nothing moves until a human flips &lt;code&gt;approved: true&lt;/code&gt;. The approved CR is the trust anchor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Staged and health-gated.&lt;/strong&gt; The operator updates one node at a time. A freshly updated node has to come back, register, and stay healthy past a soak window before the next node is touched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Halt-on-failure.&lt;/strong&gt; If a node does not reach the target version inside the timeout, the rollout stops cold. Blast radius is exactly one node, and you go look at it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verified and reversible.&lt;/strong&gt; The agent downloads the artifact, checks the SHA-256 before it touches anything, stages the new binary beside the old one, flips a single &lt;code&gt;current&lt;/code&gt; symlink atomically, and keeps a &lt;code&gt;previous&lt;/code&gt; symlink for a one-command rollback. A bad checksum leaves the running version untouched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outbound-only.&lt;/strong&gt; Edge agents are behind NAT and Tailscale. They poll out; nothing reaches in. The same shape that lets a laptop update itself lets a Mac in a closet three sites away update itself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The end state is that a release I cut becomes a one-line &lt;code&gt;kubectl apply&lt;/code&gt; and an approval, instead of an afternoon of SSH. I proved the whole loop on a live node: publish a version, apply the &lt;code&gt;AgentRelease&lt;/code&gt;, watch it sit at &lt;code&gt;AwaitingApproval&lt;/code&gt;, approve it, and watch the node drain, download, verify, flip, restart onto the new binary, and report back, the rollout closing out at &lt;code&gt;Succeeded&lt;/code&gt;. The first one is still a manual hop (an agent on the old, unaware binary cannot update itself to the version that teaches it how), but every release after that is hands-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trustworthy is more than updatable
&lt;/h2&gt;

&lt;p&gt;An auto-updating fleet that lies about its health is worse than a manual one. So alongside the update path, a batch of less glamorous reliability work, the "trustworthy fleet" milestone, had to land.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Liveness, not optimism.&lt;/strong&gt; A metal node used to register an endpoint and then keep reporting one ready replica forever, even after the host went offline for weeks. Now agents heartbeat, and the controller expires a registration that goes stale. A dead backend stops counting as a live one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Admission validation.&lt;/strong&gt; A new validating webhook checks agent and task definitions at &lt;code&gt;kubectl apply&lt;/code&gt; time, so an invalid spec is rejected at the door instead of failing confusingly three steps later when a task gets dispatched to it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A real end-to-end test.&lt;/strong&gt; Unit tests and envtest cover a lot, but nothing was exercising the full install path: helm install the chart, the operator comes up, an agent registers a node, the scheduler routes a task to it, and the task actually runs to completion. Now a kind-based CI job does exactly that.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are features you would put on a billboard. They are the difference between a demo and something you would leave pointed at production hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part where dogfooding earns its keep
&lt;/h2&gt;

&lt;p&gt;Here is the honest build-in-public bit, and the reason I trust this work more than I would trust a green test suite alone.&lt;/p&gt;

&lt;p&gt;When I ran the very first live self-update against a real node, it did not engage. The agent logged that self-update was disabled because it was "not running from a managed install root", which was wrong: it was running from exactly that root. The detection compared the running binary's resolved path against the literal &lt;code&gt;current/&lt;/code&gt; symlink path, but resolving the binary's path followed the symlink to the real versioned directory, so the two could never match. The unit test had passed for two reasons: it fed the check an unresolved path that never happens in production, and it cached its answer once, forever, so it could not have noticed anyway. The feature had quietly disabled itself on every real install, and only dogfooding the actual rollout surfaced it. The fix was small. Finding it required running the thing for real.&lt;/p&gt;

&lt;p&gt;Then there was the end-to-end test. I wrote it specifically to catch install-path bugs that unit tests cannot see, and it caught one on its first CI run: a task reached &lt;code&gt;Scheduled&lt;/code&gt; and then stalled, because the agent was watching one namespace while the task lived in another. The scheduler assigned the work; the agent never saw it. That is exactly the class of bug a real apiserver surfaces and a mock does not. The test earned its place before it had even merged.&lt;/p&gt;

&lt;p&gt;I am not going to pretend the rest of the cycle was clean either. Pinning a webhook's TLS certs the simple way tripped a CI script that had been quietly passing a giant blob through an environment variable, which works on macOS and dies on Linux. A glob model pattern that routed correctly one way compiled to a literal that matched nothing the other way, while reporting itself healthy. Every one of these passed review or local checks and got caught by the next layer: a full lint, a real cluster, an adversarial second look. That layering is the point. The goal was never zero bugs. It was no bug that survives to a node you cannot reach.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is the hard part of sovereign AI
&lt;/h2&gt;

&lt;p&gt;It is tempting to think the hard problem in self-hosted AI is the inference: the quantization, the GPU memory, the tokens per second. Those are hard, and we spend plenty of time there. But the thing that actually keeps people on a managed cloud is not raw capability. It is that someone else runs the fleet. Updates land, dead nodes get pulled, bad config gets rejected, and you do not think about any of it.&lt;/p&gt;

&lt;p&gt;If sovereign AI is going to be a real alternative and not a hobby, it has to offer that same "do not think about it" property while keeping the data and the models on hardware you own. A fleet you have to babysit by hand is not sovereign in any way that matters; it is just someone else's operational burden moved onto you. The work in this post is the unglamorous half of closing that gap: a fleet that updates itself safely, tells the truth about its own health, and refuses to accept a configuration that would break it.&lt;/p&gt;

&lt;p&gt;That is the control plane I want for local AI at scale. It is in LLMKube now, it is open source, and it caught its own bugs on the way in.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;LLMKube is a Kubernetes operator for self-hosted LLM inference: CUDA, Apple Silicon Metal, multi-GPU, and a heterogeneous fleet under one control plane. Apache 2.0, github.com/defilantech/LLMKube.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Wed, 29 Apr 2026 19:52:16 +0000</pubDate>
      <link>https://dev.to/defilan/turboquant-on-a-macbook-pro-part-2-perplexity-kl-divergence-and-asymmetric-kv-on-m5-max-1gb1</link>
      <guid>https://dev.to/defilan/turboquant-on-a-macbook-pro-part-2-perplexity-kl-divergence-and-asymmetric-kv-on-m5-max-1gb1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/turboquant-m5-max-quality-and-asymmetric" rel="noopener noreferrer"&gt;llmkube.com/blog/turboquant-m5-max-quality-and-asymmetric&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yesterday's &lt;a href="https://llmkube.com/blog/turboquant-m5-max-long-context" rel="noopener noreferrer"&gt;M5 Max KV cache post&lt;/a&gt; drew a clean set of asks in the comments: where are the perplexity numbers, what about KL divergence, did you try asymmetric K/V combos, can you fill the 32K to 128K gap with a 64K row. I ran them overnight on the same hardware. Numbers below.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;q8_0&lt;/code&gt; KV cache is essentially free at 4k context.&lt;/strong&gt; PPL delta vs &lt;code&gt;f16&lt;/code&gt; is −0.0005 (well inside the ±0.036 stderr). KL is 0.0016. Top-1 token agreement is 98.64%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;turbo3&lt;/code&gt; and &lt;code&gt;turbo4&lt;/code&gt; cost real but small quality.&lt;/strong&gt; turbo3: ~1% PPL increase, 5pp top-token disagreement, KL roughly 12× q8_0. turbo4 sits between, in line with its lower compression ratio.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ctk q8_0 -ctv turbo4&lt;/code&gt; is the new winner for long-context.&lt;/strong&gt; Matches symmetric q8_0 throughput at every depth tested and fits 512K, where symmetric q8_0 OOM'd. q8_0-grade prefill, turbo4-grade memory ceiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ctk f16 -ctv turbo4&lt;/code&gt; is broken on this fork on Metal.&lt;/strong&gt; The Metal FlashAttention kernel doesn't fast-path that K/V combination, so it falls back to a generic dequant-then-attention path. 34× slower at 8K, 78× slower at 128K. Don't use it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The 64K row shows the prefill curves nearly converged.&lt;/strong&gt; turbo3 at 470 tok/s sits within 2% of q8_0 at 479 tok/s. The bandwidth-bound regime kicks in somewhere between 64K and 128K on this hardware, earlier than the 128K crossover from yesterday's post had me estimating.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Quality eval: perplexity and KL divergence
&lt;/h2&gt;

&lt;p&gt;The original post had no quality numbers. The first comment under it flagged that gap and asked for perplexity, and a follow-on comment added KL divergence to the list. Both are now in the bag.&lt;/p&gt;

&lt;p&gt;Setup: &lt;code&gt;llama-perplexity&lt;/code&gt; from TheTom's TurboQuant fork build, wikitext-2-raw test set, context size 4096. The canonical 512 doesn't fill enough KV cache to surface cache-quantization effects, so I bumped it to 4096 to let the cache actually fill. The &lt;code&gt;f16&lt;/code&gt; run saves a baseline logits file via &lt;code&gt;--kl-divergence-base&lt;/code&gt;. Each subsequent run computes KL against that baseline, which means the comparisons are pinned to the exact same model weights and tokenization.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cache type&lt;/th&gt;
&lt;th&gt;PPL&lt;/th&gt;
&lt;th&gt;KL vs f16&lt;/th&gt;
&lt;th&gt;Top-1 token agreement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;f16&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5.7438 ± 0.0355&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;q8_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5.7433 ± 0.0355&lt;/td&gt;
&lt;td&gt;0.0016 ± 0.0001&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98.64% ± 0.03&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;turbo3&lt;/code&gt; (~4.9×)&lt;/td&gt;
&lt;td&gt;5.8092 ± 0.0360&lt;/td&gt;
&lt;td&gt;0.0199 ± 0.0002&lt;/td&gt;
&lt;td&gt;93.93% ± 0.06&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;turbo4&lt;/code&gt; (~3.8×)&lt;/td&gt;
&lt;td&gt;5.7810 ± 0.0359&lt;/td&gt;
&lt;td&gt;0.0131 ± 0.0003&lt;/td&gt;
&lt;td&gt;95.28% ± 0.06&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;q8_0 KV is essentially free at this depth.&lt;/strong&gt; PPL delta is −0.0005, which is noise inside the ±0.0355 stderr. KL is 0.0016, three orders of magnitude smaller than the turbo3 number. The quantized cache picks the same top-1 token as f16 98.64% of the time. The community worry about q8_0 corroding output quality doesn't bear out here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;turbo3 costs measurable but small quality.&lt;/strong&gt; ~1% perplexity increase, 5 percentage points of top-token disagreement, KL roughly 12× q8_0's. turbo4 sits between turbo3 and q8_0 on every metric, matching its lower compression ratio. Quality cost scales monotonically with compression, no surprises in the ranking.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One caveat I'd want to underline: PPL was at 4096 context. Quality at deeper contexts, where the cache is more saturated and dequant errors compound across more attention steps, might tell a different story. That's a bench for a future weekend.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Asymmetric K/V: which combos work, which don't
&lt;/h2&gt;

&lt;p&gt;One commenter on the original post pointed out that the big issue asymmetric KV tackles is exactly the K-precision problem: compressing the keys hurts quality a great deal more than compressing the values. The original post called this out in its caveats too but didn't bench it. Now we have data.&lt;/p&gt;

&lt;p&gt;Three combinations, same &lt;code&gt;llama-bench&lt;/code&gt; flags as yesterday's symmetric sweep. Decode tok/s (token generation):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;q8_0 K / turbo4 V&lt;/th&gt;
&lt;th&gt;q8_0 K / turbo3 V&lt;/th&gt;
&lt;th&gt;f16 K / turbo4 V&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;82.9&lt;/td&gt;
&lt;td&gt;81.8&lt;/td&gt;
&lt;td&gt;72.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;75.4&lt;/td&gt;
&lt;td&gt;75.6&lt;/td&gt;
&lt;td&gt;16.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;66.0&lt;/td&gt;
&lt;td&gt;63.2&lt;/td&gt;
&lt;td&gt;8.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;38.2&lt;/td&gt;
&lt;td&gt;2.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;27.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;25.0&lt;/td&gt;
&lt;td&gt;skipped&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;14.8&lt;/td&gt;
&lt;td&gt;skipped&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Prompt processing tells a similar story (skipping the full table for length, the relative ordering matches): q8_0/turbo4 lands within 1-2% of symmetric q8_0 prefill at every shared depth, and q8_0/turbo3 is similarly close.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;-ctk q8_0 -ctv turbo4&lt;/code&gt;: the new long-context winner
&lt;/h3&gt;

&lt;p&gt;This is the standout combination. At 256K context it puts up &lt;strong&gt;27.1 tok/s decode&lt;/strong&gt; against yesterday's symmetric q8_0 baseline of 26.6 tok/s. Prefill at 256K hits 128 tok/s versus symmetric q8_0's 124. The throughput is statistically indistinguishable from symmetric q8_0 at every depth they share.&lt;/p&gt;

&lt;p&gt;And it fits 512K, where symmetric q8_0 OOM'd in yesterday's post. Decode at 512K is &lt;strong&gt;16.5 tok/s&lt;/strong&gt;, almost identical to symmetric turbo4 at 16.0. So the asymmetric configuration gets you q8_0-level prefill behavior with turbo4-level context ceiling, on a single MacBook Pro.&lt;/p&gt;

&lt;p&gt;The hypothesis that V compresses cheap and K compresses expensive looks right on the throughput side. Quality side I'd want a PPL run on the asymmetric combos to fully close the loop, since I haven't measured KL or PPL with mixed K/V types yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;-ctk q8_0 -ctv turbo3&lt;/code&gt;: similar trick, worse decode
&lt;/h3&gt;

&lt;p&gt;Same prefill behavior as the q8_0/turbo4 combo (within 1-2% at every depth) but decode is consistently lower. Tighter V quantization taxes the per-token attention pass more, since decode is bottlenecked by dequantization work rather than total bytes read. If you have memory headroom, q8_0/turbo4 dominates.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;-ctk f16 -ctv turbo4&lt;/code&gt;: kernel fallback, do not use
&lt;/h3&gt;

&lt;p&gt;Putting &lt;code&gt;f16&lt;/code&gt; on K and &lt;code&gt;turbo4&lt;/code&gt; on V breaks the Metal FlashAttention kernel's fast path. The fork falls back to a generic dequant-then-attention implementation that's catastrophically slow:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;Symmetric f16 pp512&lt;/th&gt;
&lt;th&gt;f16 K / turbo4 V pp512&lt;/th&gt;
&lt;th&gt;Slowdown&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;2098&lt;/td&gt;
&lt;td&gt;61.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;34×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;1063&lt;/td&gt;
&lt;td&gt;16.4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;65×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;321&lt;/td&gt;
&lt;td&gt;4.1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I cut the run before 256K once the trajectory was clear. The slowdown widens with depth, which is consistent with the non-fast-path attention being O(n) more expensive in dequant work per cache access. Don't use this combination on this fork on Metal until kernel coverage lands. If you're on a different backend (CUDA), verify the same combo before assuming it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 64K row: filling the gap
&lt;/h2&gt;

&lt;p&gt;One commenter asked for a 64K data point sitting between 32K and 128K, particularly on the prefill side. Reasonable ask: yesterday's prefill curves dropped 3–4× between those two depths, so 64K is exactly the depth where the bandwidth-bound regime is supposed to kick in.&lt;/p&gt;

&lt;p&gt;All seven configurations at depth 65536:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cache&lt;/th&gt;
&lt;th&gt;pp512 (tok/s)&lt;/th&gt;
&lt;th&gt;tg128 (tok/s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;f16&lt;/code&gt; (symmetric)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;602.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;q8_0&lt;/code&gt; (symmetric)&lt;/td&gt;
&lt;td&gt;479.2&lt;/td&gt;
&lt;td&gt;57.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;turbo3&lt;/code&gt; (symmetric)&lt;/td&gt;
&lt;td&gt;469.8&lt;/td&gt;
&lt;td&gt;49.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;turbo4&lt;/code&gt; (symmetric)&lt;/td&gt;
&lt;td&gt;418.0&lt;/td&gt;
&lt;td&gt;55.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;q8_0 K / turbo4 V&lt;/td&gt;
&lt;td&gt;468.2&lt;/td&gt;
&lt;td&gt;55.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;q8_0 K / turbo3 V&lt;/td&gt;
&lt;td&gt;465.6&lt;/td&gt;
&lt;td&gt;52.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;f16 K / turbo4 V&lt;/td&gt;
&lt;td&gt;8.3&lt;/td&gt;
&lt;td&gt;4.9&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things stood out. First, the prefill curves are nearly converged at 64K already. &lt;code&gt;turbo3&lt;/code&gt; at 470 tok/s is within 2% of &lt;code&gt;q8_0&lt;/code&gt; at 479 tok/s. Yesterday's data showed turbo3 actually pulling ahead of q8_0 by 128K (253 vs 245), so the bandwidth-bound regime kicks in somewhere in the 64K to 128K range on this hardware. Earlier than I'd estimated when I wrote the original post.&lt;/p&gt;

&lt;p&gt;Second, the asymmetric q8_0/turbo* rows track symmetric q8_0 prefill closely at this depth, same as they do at the deeper depths. Same story all the way down the curve: as long as K stays at q8_0, V-side compression is essentially free on prefill.&lt;/p&gt;




&lt;h2&gt;
  
  
  Updated cache-type recommendations
&lt;/h2&gt;

&lt;p&gt;Same shape as yesterday's recommendations, with the asymmetric data folded in:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Cache type&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Coding agents (deep context, lots of generated tokens)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;-ctk q8_0 -ctv turbo4&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;q8_0-grade quality on K, turbo4 memory savings on V, fits 512K, decode 27 tok/s at 256K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG or batch QA (heavy prefill, short answers)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;-ctk q8_0 -ctv turbo4&lt;/code&gt; or symmetric &lt;code&gt;turbo3&lt;/code&gt; at the deepest depths&lt;/td&gt;
&lt;td&gt;Prefill is bandwidth-bound past ~64K, both options work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure 1M context maxing&lt;/td&gt;
&lt;td&gt;Symmetric &lt;code&gt;turbo3&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Only thing that fits 1M on a 128 GB Mac&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Short interactive (under 32K)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;f16&lt;/code&gt; if memory allows, else &lt;code&gt;q8_0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Quality cost is genuinely zero, throughput is best&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The asymmetric combos are expressible directly in LLMKube's InferenceService spec via the &lt;code&gt;cacheTypeCustomK&lt;/code&gt; and &lt;code&gt;cacheTypeCustomV&lt;/code&gt; fields that landed in 0.7.3. So if you're running this through the operator, the spec for the new long-context winner is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llamacpp&lt;/span&gt;
  &lt;span class="na"&gt;cacheTypeCustomK&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;q8_0&lt;/span&gt;
  &lt;span class="na"&gt;cacheTypeCustomV&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;turbo4&lt;/span&gt;
  &lt;span class="na"&gt;contextSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;524288&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Caveats
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Perplexity was measured at 4096 context. Quality at deeper contexts might tell a different story, since the cache fills more and dequant errors have more attention steps to compound through.&lt;/li&gt;
&lt;li&gt;Asymmetric quality numbers (PPL or KL on the q8_0/turbo* combos) are not yet measured. The throughput data argues V-side compression is cheap, but I haven't verified the quality side end-to-end.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-ctk f16 -ctv turbo*&lt;/code&gt; is a kernel fallback on this fork on Metal. Verify before assuming the same combination works on other backends. CUDA may have different kernel coverage.&lt;/li&gt;
&lt;li&gt;Single hardware data point (M5 Max, 128 GB). The crossover depths and the prefill/decode split likely shift with memory bandwidth and GPU core count.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's still in flight
&lt;/h2&gt;

&lt;p&gt;Three asks from the original thread that this followup didn't fully address. Running them next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Aider Polyglot pass for f16, turbo3, turbo4.&lt;/strong&gt; A commenter asked whether the fast cache types still produce useful code, not just fast tokens. q8_0 scored 62.2% on Polyglot earlier this week (n=225). Each Polyglot run is roughly 6 to 12 hours wall on M5 Max, so this is a few overnight runs serial. Running later this week.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wider quant types: q4_0, q4_1, iq4_nl, q5_0, q5_1.&lt;/strong&gt; Another commenter asked for these to extend the depth sweep with more cache options. After the Aider runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same sweep on a non-MoE non-DeltaNet model.&lt;/strong&gt; A third commenter asked whether these results transfer to other architectures. Qwen 3.6 uses DeltaNet hybrid attention, which already shrinks the per-token KV footprint. On a dense GQA model where cache is the dominant bottleneck the splits should be larger, not smaller. After the wider quant types.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Hardware: MacBook Pro M5 Max, 128 GB unified memory. Build: TheTom's &lt;a href="https://github.com/TheTom/llama-cpp-turboquant" rel="noopener noreferrer"&gt;llama-cpp-turboquant fork&lt;/a&gt;, branch &lt;code&gt;feature/turboquant-kv-cache&lt;/code&gt;, built with &lt;code&gt;cmake -B build -DGGML_METAL=ON&lt;/code&gt;. Model: Qwen3.6-35B-A3B Q8_0 GGUF.&lt;/p&gt;

&lt;p&gt;Quality bench: &lt;code&gt;llama-perplexity&lt;/code&gt; on wikitext-2-raw test set, &lt;code&gt;-c 4096&lt;/code&gt;, full corpus (~60 chunks). f16 baseline saved via &lt;code&gt;--kl-divergence-base&lt;/code&gt;; each quant run loaded the same baseline file via &lt;code&gt;--kl-divergence&lt;/code&gt; for KL computation against pinned logits. Same model, same tokenization, only the KV cache type varies.&lt;/p&gt;

&lt;p&gt;Throughput bench: &lt;code&gt;llama-bench&lt;/code&gt;, &lt;code&gt;-p 512 -n 128 -ngl 99 -fa 1 --threads 6 --batch-size 2048 -r 3&lt;/code&gt;, depth sweep via &lt;code&gt;-d&lt;/code&gt;. Same flags as yesterday's symmetric sweep so rows are directly comparable. Metal-agent stopped during the run for clean memory budget. Total wall-clock for the asymmetric sweep was about 8.5 hours; the 64K supplement added another 80 minutes.&lt;/p&gt;

&lt;p&gt;If you have non-M5-Max Apple Silicon and want to run a slice of this matrix on your hardware, let me know — second data point would help characterize how the crossover shifts with memory bandwidth.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>TurboQuant on a MacBook Pro: two findings the upstream discussion missed</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Tue, 28 Apr 2026 16:38:41 +0000</pubDate>
      <link>https://dev.to/defilan/turboquant-on-a-macbook-pro-two-findings-the-upstream-discussion-missed-5ae7</link>
      <guid>https://dev.to/defilan/turboquant-on-a-macbook-pro-two-findings-the-upstream-discussion-missed-5ae7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/turboquant-m5-max-long-context" rel="noopener noreferrer"&gt;llmkube.com/blog/turboquant-m5-max-long-context&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A 7-hour overnight bench on an M5 Max, two findings I haven't seen in the upstream community thread, and two PRs back to the LLMKube operator to make TurboQuant a first-class citizen of the InferenceService CRD.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;A TurboQuant-enabled &lt;code&gt;llama-server&lt;/code&gt; on Apple Silicon &lt;strong&gt;runs Qwen3.6-35B-A3B Q8 at up to 1M-token context&lt;/strong&gt; on a 128 GB MacBook Pro M5 Max. Standard &lt;code&gt;f16&lt;/code&gt; KV cache OOMs at 256K. Two findings worth quoting:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;At 128K+ context, the 3-bit KV cache (&lt;code&gt;turbo3&lt;/code&gt;) matches or beats the 8-bit cache (&lt;code&gt;q8_0&lt;/code&gt;) on prompt processing.&lt;/strong&gt; Smaller cache means less memory bandwidth pressure during attention, and the throughput gap that exists at short context flips by ~128K depth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;turbo3&lt;/code&gt; and &lt;code&gt;turbo4&lt;/code&gt; split by workload phase.&lt;/strong&gt; Long-context &lt;strong&gt;prefill&lt;/strong&gt; favors &lt;code&gt;turbo3&lt;/code&gt; (~27% faster than &lt;code&gt;turbo4&lt;/code&gt; at 256K). Long-context &lt;strong&gt;decode&lt;/strong&gt; favors &lt;code&gt;turbo4&lt;/code&gt; (~11% faster than &lt;code&gt;turbo3&lt;/code&gt; at 256K). They are not interchangeable — different attention bottlenecks dominate during prefill and decode.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We built &lt;a href="https://github.com/TheTom/llama-cpp-turboquant" rel="noopener noreferrer"&gt;TheTom's &lt;code&gt;feature/turboquant-kv-cache&lt;/code&gt; fork of llama.cpp&lt;/a&gt; for Metal, validated on M5 Max, and took two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why KV cache, why now
&lt;/h2&gt;

&lt;p&gt;If you're running coding agents locally — single-model or architect+editor combos — the binding constraint isn't model weights. It's KV cache.&lt;/p&gt;

&lt;p&gt;Weights you can quantize once, store on disk, and forget. KV cache is generated &lt;strong&gt;per token of context&lt;/strong&gt; at inference time, sized by the model's depth and head dimensions, and held in working memory the entire session. A 35B-class model with &lt;code&gt;flash-attn&lt;/code&gt; on uses roughly &lt;strong&gt;256 KB of fp16 KV per token&lt;/strong&gt;. That sounds small until you do the multiplication:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;fp16 KV&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;~8 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64K&lt;/td&gt;
&lt;td&gt;~16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;~32 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;256K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~64 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;~128 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;~256 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 128 GB MacBook with &lt;code&gt;flash-attn&lt;/code&gt; and &lt;code&gt;mlock&lt;/code&gt; on can fit one 35B model at 128K with f16 KV, just barely. 256K doesn't fit. Co-resident two-model setups (architect + editor) don't fit at all past 64K.&lt;/p&gt;

&lt;p&gt;Standard &lt;code&gt;q8_0&lt;/code&gt; quantization halves the KV footprint with sub-1% perplexity penalty. That gets you to 256K with a single model on the Mac.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TurboQuant&lt;/strong&gt; (&lt;a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/" rel="noopener noreferrer"&gt;Google Research, ICLR 2026&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer"&gt;arxiv:2504.19874&lt;/a&gt;) compresses further. Randomized Walsh-Hadamard transforms decorrelate KV blocks before scalar quantization, hitting &lt;strong&gt;~3.25 bits per value&lt;/strong&gt; (&lt;code&gt;turbo3&lt;/code&gt;) or &lt;strong&gt;~4.25 bits per value&lt;/strong&gt; (&lt;code&gt;turbo4&lt;/code&gt;) with attention-fidelity loss inside the noise floor of normal sampling variance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cache type&lt;/th&gt;
&lt;th&gt;bits/value&lt;/th&gt;
&lt;th&gt;Compression vs fp16&lt;/th&gt;
&lt;th&gt;KV at 256K&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;f16&lt;/td&gt;
&lt;td&gt;16.0&lt;/td&gt;
&lt;td&gt;1.0×&lt;/td&gt;
&lt;td&gt;~64 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;q8_0&lt;/td&gt;
&lt;td&gt;8.0&lt;/td&gt;
&lt;td&gt;2.0×&lt;/td&gt;
&lt;td&gt;~32 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;turbo4&lt;/td&gt;
&lt;td&gt;4.25&lt;/td&gt;
&lt;td&gt;3.8×&lt;/td&gt;
&lt;td&gt;~17 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;turbo3&lt;/td&gt;
&lt;td&gt;3.25&lt;/td&gt;
&lt;td&gt;4.9×&lt;/td&gt;
&lt;td&gt;~13 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Upstream discussion at &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/20969" rel="noopener noreferrer"&gt;ggml-org/llama.cpp#20969&lt;/a&gt;. Not yet in main, landing in forks per backend. &lt;strong&gt;TheTom's fork&lt;/strong&gt; is the Metal-supporting variant.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bench
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;llama-bench&lt;/code&gt; from TheTom's fork build, single Qwen3.6-35B-A3B Q8 model, sweep across cache types and KV-depths.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; Qwen3.6-35B-A3B-Q8_0.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ctk&lt;/span&gt; turbo3 &lt;span class="nt"&gt;-ctv&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; 0 &lt;span class="nt"&gt;-d&lt;/span&gt; 8192 &lt;span class="nt"&gt;-d&lt;/span&gt; 32768 &lt;span class="nt"&gt;-d&lt;/span&gt; 131072 &lt;span class="nt"&gt;-d&lt;/span&gt; 262144 &lt;span class="nt"&gt;-d&lt;/span&gt; 524288 &lt;span class="nt"&gt;-d&lt;/span&gt; 1048576 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 512 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;-fa&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--threads&lt;/span&gt; 6 &lt;span class="nt"&gt;--batch-size&lt;/span&gt; 2048 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-r&lt;/span&gt; 3 &lt;span class="nt"&gt;-o&lt;/span&gt; md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;-d N&lt;/code&gt; pre-allocates N tokens of KV cache before measuring throughput. Mean of 3 reps. Metal-agent stopped during the run for clean memory budget. The 1M cell on &lt;code&gt;turbo3&lt;/code&gt; alone took several hours wall-clock; full sweep ran ~7 hours overnight.&lt;/p&gt;




&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Generation throughput (tok/s)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;f16&lt;/th&gt;
&lt;th&gt;q8_0&lt;/th&gt;
&lt;th&gt;turbo3&lt;/th&gt;
&lt;th&gt;turbo4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;87.4&lt;/td&gt;
&lt;td&gt;79.5&lt;/td&gt;
&lt;td&gt;79.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;79.2&lt;/td&gt;
&lt;td&gt;72.2&lt;/td&gt;
&lt;td&gt;71.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;72.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;67.8&lt;/td&gt;
&lt;td&gt;61.5&lt;/td&gt;
&lt;td&gt;61.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64K&lt;/td&gt;
&lt;td&gt;60.7&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;44.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40.7&lt;/td&gt;
&lt;td&gt;36.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;37.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26.6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25.5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;13.3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.51&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Prompt processing throughput (tok/s)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;f16&lt;/th&gt;
&lt;th&gt;q8_0&lt;/th&gt;
&lt;th&gt;turbo3&lt;/th&gt;
&lt;th&gt;turbo4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2962&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2948&lt;/td&gt;
&lt;td&gt;2904&lt;/td&gt;
&lt;td&gt;2854&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2098&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1623&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1653&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1439&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1063&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;802&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;784&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;678&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;321&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;245&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;253&lt;/strong&gt; ← turbo3 ≥ q8_0&lt;/td&gt;
&lt;td&gt;206&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;124&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;128&lt;/strong&gt; ← turbo3 &amp;gt; q8_0&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;66&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full grid is final. Bench ran 8h 20m wall-clock.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 1: turbo3 beats q8_0 at long context
&lt;/h2&gt;

&lt;p&gt;The framing in the upstream discussion is approximately &lt;em&gt;"turbo3 trades a small (~10%) generation throughput hit for ~2.5× more KV memory headroom."&lt;/em&gt; That's true at short context. At long context, &lt;strong&gt;the trade flips&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At 128K depth, f16 wins prefill at 321 tok/s, but &lt;strong&gt;turbo3 at 253 tok/s edges out q8_0 at 245 tok/s&lt;/strong&gt;. At 256K (where f16 OOMs), &lt;strong&gt;turbo3 at 128 tok/s beats q8_0 at 124 tok/s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What's happening: at 35B-class model size with deep contexts, the GPU spends most of its time during attention reading KV cache from memory rather than computing on it. Smaller cache → less bandwidth pressure → throughput recovers, even though there's more dequantization work per access. The break-even is somewhere between 32K and 128K on M5 Max.&lt;/p&gt;

&lt;p&gt;For coding-agent workloads where context grows monotonically across a session, &lt;strong&gt;this is the regime that matters&lt;/strong&gt;. You're spending most of your tokens at 32K+ depth, not at depth 0.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 2: turbo3 and turbo4 split by workload phase
&lt;/h2&gt;

&lt;p&gt;The 25% extra bits per value in &lt;code&gt;turbo4&lt;/code&gt; (4.25 vs 3.25 bits) buys you something specific, and what it buys depends on the phase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefill (prompt processing) at long context:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;turbo3 pp&lt;/th&gt;
&lt;th&gt;turbo4 pp&lt;/th&gt;
&lt;th&gt;turbo3 advantage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;1653&lt;/td&gt;
&lt;td&gt;1439&lt;/td&gt;
&lt;td&gt;+15%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;784&lt;/td&gt;
&lt;td&gt;678&lt;/td&gt;
&lt;td&gt;+16%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;253&lt;/td&gt;
&lt;td&gt;206&lt;/td&gt;
&lt;td&gt;+23%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;+27%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;+18%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Smaller cache means less data to read per attention step; during prefill the GPU pulls huge contiguous batches through attention, and the bandwidth-bound regime favors &lt;code&gt;turbo3&lt;/code&gt; cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decode (generation) at long context:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Depth&lt;/th&gt;
&lt;th&gt;turbo3 tg&lt;/th&gt;
&lt;th&gt;turbo4 tg&lt;/th&gt;
&lt;th&gt;turbo4 advantage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;36.0&lt;/td&gt;
&lt;td&gt;37.7&lt;/td&gt;
&lt;td&gt;+5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;22.9&lt;/td&gt;
&lt;td&gt;25.5&lt;/td&gt;
&lt;td&gt;+11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512K&lt;/td&gt;
&lt;td&gt;13.3&lt;/td&gt;
&lt;td&gt;16.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+20%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;During decode the dequantization overhead per access matters more than total bytes read. &lt;code&gt;turbo4&lt;/code&gt;'s simpler representation (4.25 bits has less complex quantization geometry than 3.25 bits) wins at the per-token attention pass — and the gap &lt;strong&gt;widens&lt;/strong&gt; with depth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical implications by workload:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload shape&lt;/th&gt;
&lt;th&gt;Cache type&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Aider/OpenCode coding agents (deep context, lots of generated tokens)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;turbo4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wins decode at depth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG-heavy / batch question answering (heavy prefill, short answers)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;turbo3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wins prefill at depth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure context-window maximization (1M context)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;turbo3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Only it fits at 1M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Short-context interactive (≤32K)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;f16&lt;/code&gt; if it fits, else &lt;code&gt;q8_0&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Both turbos are ~10% slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This isn't a framing the upstream community discussion has surfaced clearly. Different bottleneck regimes for different phases, and the right cache type depends on which phase dominates your workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this enables on a MacBook
&lt;/h2&gt;

&lt;p&gt;Three concrete capabilities:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;256K context for two co-resident coding models.&lt;/strong&gt; turbo3 KV at 256K (~13 GB) plus 37 GB Qwen3.6 weights, alongside Devstral-Small-2-24B at the same context with comparable footprint, totals ~88 GB. Under the 100 GB practical budget.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;1M context for batch / agentic workloads.&lt;/strong&gt; turbo3 KV at 1M is ~52 GB. We measured &lt;strong&gt;30 tok/s prefill, 6.5 tok/s decode at 1M&lt;/strong&gt; on Qwen3.6-35B-A3B Q8. Slow — a 4K-token agent response at 1M context is ~10 minutes wall-clock — but &lt;strong&gt;it works&lt;/strong&gt;. Overnight agentic batches that need the full context window are feasible. As far as we can tell, nobody else has demonstrated this on Apple Silicon yet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;More headroom for non-attention buffers.&lt;/strong&gt; Cutting KV by 5× makes batch buffers, prefix cache, and draft models for speculative decoding actually composable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Caveats
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TheTom's fork is research-grade.&lt;/strong&gt; Pinned to commit &lt;code&gt;11a241d0d&lt;/code&gt;; rebases needed as upstream moves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLMKube's metal-runtime can't drive turbo3/turbo4 yet&lt;/strong&gt; because of &lt;a href="https://github.com/defilantech/LLMKube/issues/349" rel="noopener noreferrer"&gt;#349&lt;/a&gt; and &lt;a href="https://github.com/defilantech/LLMKube/issues/350" rel="noopener noreferrer"&gt;#350&lt;/a&gt;. &lt;a href="https://github.com/defilantech/LLMKube/pull/353" rel="noopener noreferrer"&gt;PR #353&lt;/a&gt; closes #350; #349 is next.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No perplexity numbers in this run.&lt;/strong&gt; Throughput and memory ceilings only. The +1% perplexity penalty for turbo3 in the upstream discussion is on Qwen 3.5 — we'll re-run on Qwen 3.6 in a follow-up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single hardware sample.&lt;/strong&gt; M5 Max only. Crossover point and prefill/decode split likely shift with memory bandwidth (614 GB/s on M5 Max) and GPU core count.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What we contributed back
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/defilantech/LLMKube/pull/351" rel="noopener noreferrer"&gt;LLMKube PR #351&lt;/a&gt;&lt;/strong&gt; (merged): &lt;code&gt;cacheTypeCustomK&lt;/code&gt;/&lt;code&gt;cacheTypeCustomV&lt;/code&gt; on &lt;code&gt;InferenceServiceSpec&lt;/code&gt;. Closes &lt;a href="https://github.com/defilantech/LLMKube/issues/282" rel="noopener noreferrer"&gt;#282&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/defilantech/LLMKube/pull/353" rel="noopener noreferrer"&gt;LLMKube PR #353&lt;/a&gt;&lt;/strong&gt; (open): metal-agent respawns on ISVC spec drift; honors &lt;code&gt;replicas: 0&lt;/code&gt;. Closes &lt;a href="https://github.com/defilantech/LLMKube/issues/350" rel="noopener noreferrer"&gt;#350&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Issues filed:&lt;/strong&gt; &lt;a href="https://github.com/defilantech/LLMKube/issues/349" rel="noopener noreferrer"&gt;#349&lt;/a&gt;, &lt;a href="https://github.com/defilantech/LLMKube/issues/350" rel="noopener noreferrer"&gt;#350&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comment going to llama.cpp discussion #20969&lt;/strong&gt; with the M5 Max numbers and the prefill/decode split.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Build TheTom's fork&lt;/span&gt;
git clone https://github.com/TheTom/llama-cpp-turboquant.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;-j&lt;/span&gt;

&lt;span class="c"&gt;# 2. Run the bench (turbo3 and turbo4 separately to see the split)&lt;/span&gt;
./build/bin/llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; /path/to/your/model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ctk&lt;/span&gt; turbo3 &lt;span class="nt"&gt;-ctv&lt;/span&gt; turbo3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; 0 &lt;span class="nt"&gt;-d&lt;/span&gt; 32768 &lt;span class="nt"&gt;-d&lt;/span&gt; 131072 &lt;span class="nt"&gt;-d&lt;/span&gt; 262144 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 512 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;-fa&lt;/span&gt; 1 &lt;span class="nt"&gt;-r&lt;/span&gt; 3 &lt;span class="nt"&gt;-o&lt;/span&gt; md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Memory ceiling depends on your unified-memory budget; sub-64 GB Macs probably can't reach 256K with a 35B-class model at any cache type. M3 Pro/Max territory is more realistic for 13B models at 128K with turbo3.&lt;/p&gt;

&lt;p&gt;For NVIDIA: &lt;a href="https://github.com/spiritbuun/llama-cpp-turboquant-cuda" rel="noopener noreferrer"&gt;@spiritbuun's CUDA fork&lt;/a&gt; is the equivalent path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Open invitation
&lt;/h2&gt;

&lt;p&gt;If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same bench, &lt;strong&gt;we want your numbers&lt;/strong&gt;. The crossover point and the prefill/decode split likely shift with memory bandwidth.&lt;/p&gt;

&lt;p&gt;Drop results in &lt;a href="https://github.com/ggml-org/llama.cpp/discussions/20969" rel="noopener noreferrer"&gt;llama.cpp discussion #20969&lt;/a&gt; or open an issue on &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;defilantech/llmkube&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>kubernetes</category>
      <category>opensource</category>
    </item>
    <item>
      <title>62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 27 Apr 2026 06:24:59 +0000</pubDate>
      <link>https://dev.to/defilan/628-on-aider-polyglot-from-a-macbook-pro-then-the-other-model-we-tried-scored-4-heres-what-17ed</link>
      <guid>https://dev.to/defilan/628-on-aider-polyglot-from-a-macbook-pro-then-the-other-model-we-tried-scored-4-heres-what-17ed</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/m5-max-aider-polyglot-and-finops" rel="noopener noreferrer"&gt;llmkube.com/blog/m5-max-aider-polyglot-and-finops&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A 24-hour Aider Polyglot run, a follow-up bench that blew up in interesting ways, and a working &lt;code&gt;$/MTok&lt;/code&gt; number from a Kubernetes operator that scrapes Apple Silicon power live. Two open-source PRs landed today to make all of this reproducible on any M-series Mac.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This is a coding-model benchmark on locally-served weights, plus a FinOps story.&lt;/strong&gt; Every benchmark number traces to results files we can show you. Every cost number traces to a CSV captured by InferCost during the run. The point is the methodology and the tooling; the model rankings are along for the ride.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3.6-35B-A3B Q8&lt;/strong&gt; (Tongyi Lab, Apache 2.0) hit &lt;strong&gt;62.2% on Aider Polyglot&lt;/strong&gt; (pass_rate_2, n=225/225) running locally on a MacBook Pro M5 Max via LLMKube's Metal Agent. That places it above Claude Sonnet 4 with 32k thinking budget (61.3%), o1-high (61.7%), DeepSeek R1 original (56.9%), and Claude 3.5 Sonnet (51.6%) on the official Aider leaderboard. It also beats every published Qwen-family entry on the Polyglot board.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Devstral-Small-2-2512 Q8&lt;/strong&gt; (Mistral, Apache 2.0) hit &lt;strong&gt;4% on Aider Polyglot diff format&lt;/strong&gt;, &lt;strong&gt;8% on Aider Polyglot whole format&lt;/strong&gt;, and &lt;strong&gt;81.7% on HumanEval+ (164 problems, all passed standard)&lt;/strong&gt;. Same model. 20× swing. Benchmark numbers don't transfer across harnesses, and you should never quote one without naming the other.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;InferCost ran the whole time.&lt;/strong&gt; The new Apple Silicon collector (shipped in &lt;a href="https://github.com/defilantech/infercost/releases/tag/v0.3.0" rel="noopener noreferrer"&gt;InferCost v0.3.0&lt;/a&gt;) reconciled &lt;code&gt;$0.18/hr&lt;/code&gt; against the &lt;code&gt;apple-m5-max&lt;/code&gt; CostProfile, with InferCost's reading agreeing with the LLMKube agent's direct gauge within &lt;code&gt;1.6 W&lt;/code&gt; mean delta over the Qwen window. First widely-published &lt;code&gt;$/MTok&lt;/code&gt; number for an Apple Silicon LLM workload that traces to a real Prometheus scrape.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two releases shipped alongside this post&lt;/strong&gt; make all of it reproducible on your own Mac: &lt;a href="https://github.com/defilantech/llmkube/releases/tag/v0.7.2" rel="noopener noreferrer"&gt;LLMKube v0.7.2&lt;/a&gt; (Apple power gauges via powermetrics, security-hardened sudoers, and a one-command &lt;code&gt;make install-powermetrics-sudo&lt;/code&gt;) and &lt;a href="https://github.com/defilantech/infercost/releases/tag/v0.3.0" rel="noopener noreferrer"&gt;InferCost v0.3.0&lt;/a&gt; (Metal collector, condition reporting, sample CostProfile).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. The hardware and what's special about it
&lt;/h2&gt;

&lt;p&gt;The bench machine is a MacBook Pro M5 Max, 2026 model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;40-core integrated, Metal 4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;18-core (6 P-cores, 12 E-cores)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unified memory&lt;/td&gt;
&lt;td&gt;128 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory bandwidth&lt;/td&gt;
&lt;td&gt;614 GB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;macOS 25.4 (Darwin)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Price&lt;/td&gt;
&lt;td&gt;About $4,500 fully configured&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: &lt;a href="https://www.apple.com/newsroom/2026/03/apple-debuts-m5-pro-and-m5-max-to-supercharge-the-most-demanding-pro-workflows/" rel="noopener noreferrer"&gt;Apple newsroom&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The 614 GB/s bandwidth is the constraint that decides everything that follows. For a dense 24B model at Q8, you need to read about 25 GB per generated token, so the upper bound is &lt;code&gt;614 / 25 = 24.56 t/s&lt;/code&gt; and we measured 24 t/s, within 2.3% of the wall. For a MoE like Qwen3.6-35B-A3B, only the active 3B parameters read per token, so the wall is ~200 t/s and you actually get to choose how to spend the bandwidth. That's the whole story behind why MoE feels fast on a Mac.&lt;/p&gt;

&lt;p&gt;Stack: LLMKube v0.7.x with the Metal Agent feature branch from PR #334 cherry-picked in (now main), &lt;code&gt;llama-server&lt;/code&gt; from llama.cpp Metal, and a kind cluster on the same host for the K8s control plane. InferCost was running locally via &lt;code&gt;go run ./cmd/main.go&lt;/code&gt;, pointed at the LLMKube agent's &lt;code&gt;/metrics&lt;/code&gt; endpoint via a new &lt;code&gt;--metal-endpoint&lt;/code&gt; flag.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Qwen3.6-35B-A3B Q8 on Aider Polyglot
&lt;/h2&gt;

&lt;p&gt;The Qwen3.6 family includes a dense 27B and an MoE variant at 35B total / 3B active per token. We ran the MoE quantized to Q8_0 (~36 GB on disk, fits comfortably in 128 GB unified memory with room for KV cache and the rest of macOS).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aider.chat/docs/leaderboards/" rel="noopener noreferrer"&gt;Aider Polyglot&lt;/a&gt; is a 225-problem benchmark across C++, Go, Java, JavaScript, Python, and Rust, designed to keep top frontier coding LLMs in the 5-50% range. Each model gets two attempts per problem: a single-shot solve, and a second attempt after seeing the failed test output. The headline metric is &lt;code&gt;pass_rate_2&lt;/code&gt;, the percentage of problems that passed all tests within those two attempts.&lt;/p&gt;

&lt;p&gt;Aider was driven from inside a Docker container (&lt;code&gt;aider-benchmark&lt;/code&gt; image) talking to llama-server via &lt;code&gt;OPENAI_API_BASE=http://host.docker.internal:&amp;lt;port&amp;gt;/v1&lt;/code&gt;. Edit format was &lt;code&gt;diff&lt;/code&gt; (Aider's standard for capable models). Threads = 4. The model id we passed to LiteLLM was &lt;code&gt;openai/Qwen3.6-35B-A3B-Q8_0.gguf&lt;/code&gt;, the basename llama-server reports.&lt;/p&gt;

&lt;p&gt;The full run took &lt;strong&gt;49.9 hours of inference wall-clock time&lt;/strong&gt; stretched across about 24 hours of real time, plus a follow-up resume cycle to handle a runaway-reasoning failure mode. More on that in §4.&lt;/p&gt;

&lt;h3&gt;
  
  
  The headline result
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;pass_rate_2 = 62.2%&lt;/code&gt; (140 of 225), &lt;code&gt;pass_rate_1 = 34.7%&lt;/code&gt; (78 of 225)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Verified against the official &lt;a href="https://github.com/Aider-AI/aider/blob/main/aider/website/_data/polyglot_leaderboard.yml" rel="noopener noreferrer"&gt;Aider Polyglot leaderboard yaml&lt;/a&gt; pulled today, here's where that lands among the published baselines:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pass_rate_2&lt;/th&gt;
&lt;th&gt;format&lt;/th&gt;
&lt;th&gt;model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;88.0%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;gpt-5 (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;84.9%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;o3-pro (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;81.3%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;o3 (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;79.6%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;grok-4 (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;72.0%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude Opus 4 (32k thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;71.4%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;DeepSeek R1 (0528)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64.9%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude 3.7 Sonnet (32k thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64.0%&lt;/td&gt;
&lt;td&gt;architect&lt;/td&gt;
&lt;td&gt;DeepSeek R1 + Claude 3.5 Sonnet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;62.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;diff&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen3.6-35B-A3B Q8 (this run, M5 Max, Apache 2.0, ours)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;61.7%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;o1-2024-12-17 (high)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;61.3%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4 (32k thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;60.4%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude 3.7 Sonnet (no thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;59.6%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Qwen3 235B A22B (no think, Alibaba API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;56.9%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;DeepSeek R1 (original)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;56.4%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4 (no thinking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;51.6%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Claude 3.5 Sonnet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Qwen3 32B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8.0%&lt;/td&gt;
&lt;td&gt;diff&lt;/td&gt;
&lt;td&gt;Qwen2.5-Coder-32B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The defensible reads:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Beats Claude Sonnet 4 with 32k thinking budget by &lt;strong&gt;0.9 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Beats o1-high by &lt;strong&gt;0.5 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Beats DeepSeek R1 original by &lt;strong&gt;5.3 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Beats Claude 3.5 Sonnet by &lt;strong&gt;10.6 percentage points&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Within &lt;strong&gt;2.7 points of Claude 3.7 Sonnet (32k thinking)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Strongest open-weights Qwen-family number on the Polyglot leaderboard. Qwen3 32B sat at 40.0%, Qwen3 235B A22B at 59.6%. The 35B-A3B MoE quantization is doing real work for its size.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What we are not claiming: that this beats Opus 4, GPT-5, o3, or DeepSeek V3.2-Exp Reasoner. Those all sit above us on the leaderboard. Qwen3.6 is in the same band as Sonnet 4 thinking, not in the band with o3-high or GPT-5.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-language
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;pass_1&lt;/th&gt;
&lt;th&gt;pass_2&lt;/th&gt;
&lt;th&gt;p2 %&lt;/th&gt;
&lt;th&gt;avg min/exercise&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;python&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73.5%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;javascript&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;go&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;61.5%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rust&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;56.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cpp&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53.8%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;21.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;java&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;31.9&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things worth noting. First, Python and JavaScript at ~73% looks like clean Sonnet-3.5-thinking territory on the languages most developers actually use Aider for. Second, Java at 31.9 minutes per exercise on average is inflated by the runaway-reasoning case described next. Strip the outlier and Java's average is in line with C++.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The runaway-reasoning failure mode (and the resume that closed it out)
&lt;/h2&gt;

&lt;p&gt;About 21 hours into the run, the container settled into a Java exercise that consumed 80 minutes of wall time without writing a new result file or producing meaningful output. The log mtime stayed frozen, the container stayed "Up," and the model was clearly deep in a reasoning loop with no exit strategy. We stopped the container manually at &lt;strong&gt;n=223/225&lt;/strong&gt; and recorded the runaway-reasoning failure mode as a real characteristic of hybrid-thinking MoE models on agentic harnesses.&lt;/p&gt;

&lt;p&gt;The next night, we &lt;strong&gt;resumed via Aider's official &lt;code&gt;--cont&lt;/code&gt; flag&lt;/strong&gt; against the same run directory. Two missing exercises (&lt;code&gt;rust/forth&lt;/code&gt; and &lt;code&gt;javascript/go-counting&lt;/code&gt;) ran in parallel under &lt;code&gt;--threads 4&lt;/code&gt; and completed in about 6 minutes each. Both failed both attempts. Final result: &lt;strong&gt;n=225/225&lt;/strong&gt;, &lt;strong&gt;pass_rate_2 = 62.2%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The headline ticked &lt;strong&gt;down&lt;/strong&gt; by 0.6 percentage points compared to the n=223 partial (62.8% → 62.2%) because the two missing exercises both failed. That's the most honest defense against any "stopped early to lock in a favorable number" critique: completing the run actually hurt us.&lt;/p&gt;

&lt;p&gt;If you reproduce this and see a similar hang, kill the container, run with &lt;code&gt;--cont&lt;/code&gt; later to fill in the gaps. The full data is healthier than a partial.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The other thing we wanted to test
&lt;/h2&gt;

&lt;p&gt;With Qwen3.6 in hand, the natural next move was a comparison candidate. The ideal contrast: a dense model purpose-built for agentic coding, not a general-purpose coder.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512" rel="noopener noreferrer"&gt;Devstral-Small-2-24B-Instruct-2512&lt;/a&gt; was the obvious pick. Mistral and All Hands AI co-trained it specifically for software-engineering agents, it's Apache 2.0 dense 24B, has a 256K context window, and Mistral published 68.0% SWE-Bench Verified for it (a real number on a real benchmark). Released November 2025, so 5 months old at time of writing. Architecture is the new "Ministral 3 with rope-scaling and Scalable-Softmax" stack from Mistral, structurally different from Devstral 1.x.&lt;/p&gt;

&lt;p&gt;We deployed it via the same LLMKube + Metal Agent path, kicked off Aider Polyglot with &lt;code&gt;--num-tests 25&lt;/code&gt; (random subset, fits a 4-hour window at Devstral's slower decode speed of ~24 t/s), edit format &lt;code&gt;diff&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;&lt;code&gt;pass_rate_2 = 4.0%&lt;/code&gt; (1 of 25)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Almost wrote it off as broken. Then read the Aider results files more carefully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;92% of responses were syntactically well-formed diffs.&lt;/li&gt;
&lt;li&gt;Zero exhausted context windows.&lt;/li&gt;
&lt;li&gt;Average 4.4 minutes per exercise (fast, not stuck).&lt;/li&gt;
&lt;li&gt;The model was producing valid-looking edit blocks, they were just semantically wrong.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model wasn't broken. It was doing what it had been trained to do, which apparently wasn't this.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Investigation
&lt;/h2&gt;

&lt;p&gt;Three hypotheses, ordered by what we tried:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 1: The diff format is the problem.&lt;/strong&gt; Aider supports &lt;code&gt;--edit-format whole&lt;/code&gt; (output complete files instead of diffs). Re-ran with whole format on the same 25-exercise subset.&lt;/p&gt;

&lt;p&gt;Result: &lt;code&gt;pass_rate_2 = 8.0%&lt;/code&gt; (2 of 25). Better, but not by much. Hypothesis weakly supported.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 2: llama.cpp isn't handling Devstral 2's new architecture correctly.&lt;/strong&gt; Worth checking before declaring the model bad. We ran HumanEval+ via &lt;a href="https://github.com/evalplus/evalplus" rel="noopener noreferrer"&gt;evalplus&lt;/a&gt;, pointed at the same llama-server endpoint, with a function-level Python coding harness that doesn't require any agentic tool-call discipline. If llama.cpp's tokenizer or attention implementation was off, we'd see it here.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;&lt;code&gt;HumanEval pass@1 = 85.4%&lt;/code&gt;, &lt;code&gt;HumanEval+ pass@1 = 81.7%&lt;/code&gt;&lt;/strong&gt; (164 problems, scored in &lt;code&gt;ganler/evalplus&lt;/code&gt; Linux container because macOS's &lt;code&gt;setrlimit(RLIMIT_AS)&lt;/code&gt; doesn't behave the way evalplus's sandbox expects).&lt;/p&gt;

&lt;p&gt;That landed Devstral 2 in the same band as the top open-source 24B coders for function-level Python. Architecture is fine. llama.cpp is fine. The model is genuinely capable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 3: The harness is the variable.&lt;/strong&gt; We re-read Mistral's README:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Devstral 2 can also be used with the following scaffoldings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mistral Vibe (recommended)&lt;/li&gt;
&lt;li&gt;Cline&lt;/li&gt;
&lt;li&gt;Kilo Code&lt;/li&gt;
&lt;li&gt;Claude Code&lt;/li&gt;
&lt;li&gt;OpenHands&lt;/li&gt;
&lt;li&gt;SWE Agent&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Aider is not on this list. Devstral 2 was trained on tool-call traces from agentic-coding harnesses that use multi-turn function calls, not Aider's single-prompt-with-diff edit format. The model was producing what its training distribution rewarded; Aider's harness was scoring it on a different distribution entirely.&lt;/p&gt;

&lt;p&gt;Mistral itself adds, in the same README:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;we advise everyone to use the Mistral AI API if the model is subpar with local serving&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's an explicit caveat from the model authors. The 4% wasn't a model failure or a runtime failure. It was a harness-distribution mismatch, exactly the failure mode the README warned about.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Same model, three benchmarks, three answers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Devstral 2 score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Aider Polyglot, diff format&lt;/td&gt;
&lt;td&gt;4.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aider Polyglot, whole format&lt;/td&gt;
&lt;td&gt;8.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval+ (with adversarial tests)&lt;/td&gt;
&lt;td&gt;81.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HumanEval (base)&lt;/td&gt;
&lt;td&gt;85.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Twenty times difference in measured "performance" on the same model, same hardware, same temperature, same week. This is the lesson worth taking away from the entire bench session.&lt;/p&gt;

&lt;p&gt;If you publish a single benchmark number for any agentic coding model, you are publishing a story about that model's compatibility with one specific harness, not a story about the model's coding capability. The Devstral 2 4% on Aider does not mean Devstral 2 is bad at coding. The Devstral 2 81.7% on HumanEval+ does not mean Devstral 2 is good at agentic edits in your IDE. They are both true and they describe different things.&lt;/p&gt;

&lt;p&gt;If you want to evaluate a coding model, run it through the harness you actually use day to day. If you can't, then quote at least two benchmarks from different parts of the harness landscape (one function-level, one agentic) and let the reader see the spread.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. InferCost was running the whole time
&lt;/h2&gt;

&lt;p&gt;While the benchmarks were producing accuracy numbers, &lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt; was producing the cost numbers. The new Apple Silicon collector (shipped in &lt;a href="https://github.com/defilantech/infercost/releases/tag/v0.3.0" rel="noopener noreferrer"&gt;InferCost v0.3.0&lt;/a&gt;) was reconciling the &lt;code&gt;apple-m5-max&lt;/code&gt; CostProfile every 30 seconds against the LLMKube Metal Agent's &lt;code&gt;apple_power_combined_watts&lt;/code&gt; gauge.&lt;/p&gt;

&lt;p&gt;Specifically, two things were running in the background of every benchmark above:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A second LLMKube Metal Agent on port 9091 with &lt;code&gt;--apple-power-enabled&lt;/code&gt;, publishing the four new &lt;code&gt;apple_power_*_watts&lt;/code&gt; Prometheus gauges sourced from a sudo'd &lt;code&gt;powermetrics&lt;/code&gt; subprocess. Pinned-argv NOPASSWD sudoers entry to keep the privilege grant tight (security audit caught and fixed three findings before merge: argv pinning, bin override rejection, absolute &lt;code&gt;/usr/bin/sudo&lt;/code&gt; to defeat $PATH attacks).&lt;/li&gt;
&lt;li&gt;InferCost as a local controller, pointed at &lt;code&gt;:9091/metrics&lt;/code&gt; via the new &lt;code&gt;--metal-endpoint&lt;/code&gt; CLI flag, reconciling an &lt;code&gt;apple-m5-max&lt;/code&gt; CostProfile using the new Metal scraper and dispatcher.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Plus a tiny CSV poller that sampled both layers every 60 seconds, writing 388 rows of telemetry across the day.&lt;/p&gt;

&lt;p&gt;Per-window aggregates, captured live during the runs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Mean combined W&lt;/th&gt;
&lt;th&gt;Mean InferCost $/hr&lt;/th&gt;
&lt;th&gt;Agent ↔ InferCost Δ (mean)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.6-35B-A3B Q8 (full Aider)&lt;/td&gt;
&lt;td&gt;200 min&lt;/td&gt;
&lt;td&gt;27.3 W&lt;/td&gt;
&lt;td&gt;$0.1775&lt;/td&gt;
&lt;td&gt;1.60 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, Aider diff&lt;/td&gt;
&lt;td&gt;32 min&lt;/td&gt;
&lt;td&gt;32.7 W&lt;/td&gt;
&lt;td&gt;$0.1773&lt;/td&gt;
&lt;td&gt;6.21 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, Aider whole&lt;/td&gt;
&lt;td&gt;29 min&lt;/td&gt;
&lt;td&gt;35.3 W&lt;/td&gt;
&lt;td&gt;$0.1774&lt;/td&gt;
&lt;td&gt;8.08 W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, HumanEval+&lt;/td&gt;
&lt;td&gt;55 min&lt;/td&gt;
&lt;td&gt;29.0 W&lt;/td&gt;
&lt;td&gt;$0.1770&lt;/td&gt;
&lt;td&gt;0.90 W&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "Agent ↔ InferCost Δ" column is the validation result. The agent reads powermetrics every second; InferCost samples the gauge during its 30-second reconcile loop. If they were deeply wrong about each other we'd see double-digit deltas. We don't. Across the four windows, mean delta ranged from 0.9 W to 8 W (the 8 W was during Aider whole format, which has bursty prefill that the 30-second reconcile sometimes catches mid-spike). For sustained workloads the agreement is sub-watt.&lt;/p&gt;

&lt;p&gt;Here is what &lt;code&gt;kubectl get costprofile apple-m5-max -o yaml&lt;/code&gt; looked like during the run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;currentPowerDrawWatts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;39.13&lt;/span&gt;
  &lt;span class="na"&gt;hourlyCostUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.1805&lt;/span&gt;
  &lt;span class="na"&gt;amortizationRatePerHour&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.17466&lt;/span&gt;
  &lt;span class="na"&gt;electricityCostPerHour&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.00341&lt;/span&gt;
  &lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MetalReachable&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True"&lt;/span&gt;
    &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;MetalHealthy&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Metal&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;healthy&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;http://localhost:9091/metrics&lt;/span&gt;
              &lt;span class="s"&gt;(39.1W&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;combined;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;gpu=37.3&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cpu=1.8&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ane=0.0)."&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ready&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True"&lt;/span&gt;
    &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CostComputed&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hourly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cost:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$0.1805&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(amort:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$0.1747,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;elec:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$0.0059)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not a screenshot. Not a slide. The actual reconcile output from a Kubernetes operator scraping a sudo'd &lt;code&gt;powermetrics&lt;/code&gt; subprocess on the same Mac that was running the benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. The cost economics
&lt;/h2&gt;

&lt;p&gt;The $4,500 laptop, amortized over 3 years, with maintenance at 2% of the purchase price flat:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amortization per hour: &lt;code&gt;$4,500 × 1.02 / 3 / 8760 = $0.17466/hr&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Electricity at 41 W and $0.08/kWh (Peninsula Light residential rate, WA): &lt;code&gt;0.041 × 0.08 = $0.00328/hr&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Total hourly: &lt;strong&gt;$0.178/hr, of which 98.1% is amortization and 1.9% is electricity&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That ratio is the most useful thing the bench taught us. The marginal cost of running an LLM on a laptop you already own is essentially the electricity, which on Apple Silicon is genuinely cheap. The amortized cost is the laptop existing at all, which you pay whether or not the model runs.&lt;/p&gt;

&lt;p&gt;Two &lt;code&gt;$/MTok&lt;/code&gt; numbers from the windows where the token poller was working correctly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Total tokens&lt;/th&gt;
&lt;th&gt;$/MTok&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, Aider whole (sustained edits)&lt;/td&gt;
&lt;td&gt;158,614&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.30&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Devstral 2, HumanEval+ (sequential function calls)&lt;/td&gt;
&lt;td&gt;90,916&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1.76&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Aider's whole-file edits keep the GPU producing tokens for longer continuous bursts, which spreads the fixed amortization across more output. HumanEval+ runs many short function-level problems with eval-script setup time between them, which inflates the per-token cost because the laptop is "active" but not generating.&lt;/p&gt;

&lt;p&gt;Stacked against &lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;Anthropic's published 2026 pricing&lt;/a&gt; of $3/MT input + $15/MT output for Claude Sonnet 4.6, blended around $6 to $9 per million total tokens depending on input:output ratio:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local Devstral 2 sustained at &lt;strong&gt;$0.30/MTok&lt;/strong&gt;: about &lt;strong&gt;30× cheaper&lt;/strong&gt; at the margin than cloud Sonnet 4.6.&lt;/li&gt;
&lt;li&gt;Local Devstral 2 with idle gaps at &lt;strong&gt;$1.76/MTok&lt;/strong&gt;: about &lt;strong&gt;5× cheaper&lt;/strong&gt; at the margin.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both ratios assume the laptop is running 24/7 for the 3-year amortization horizon. If you actually use the laptop 8 hours a day, the effective amortization-per-active-hour is 3× higher, which compresses the ratio. If you use it 2 hours a day, 12× higher, ratio collapses. The InferCost &lt;code&gt;UsageReport&lt;/code&gt; CRD is built specifically to compute the active vs idle split over a billing period, which is the FinOps question that nobody else is answering for Apple Silicon.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. What we shipped today, and how to use it
&lt;/h2&gt;

&lt;p&gt;Two releases shipped alongside this post, both of which were necessary to do the cost story above end to end:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/defilantech/llmkube/releases/tag/v0.7.2" rel="noopener noreferrer"&gt;LLMKube v0.7.2&lt;/a&gt;: Apple Silicon power gauges via powermetrics + one-command sudoers install&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds 4 new Prometheus gauges (&lt;code&gt;combined / gpu / cpu / ane&lt;/code&gt; watts) to the existing Metal Agent (&lt;a href="https://github.com/defilantech/llmkube/pull/334" rel="noopener noreferrer"&gt;PR #334&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Sourced from a sudo'd &lt;code&gt;powermetrics --samplers cpu_power,gpu_power -i 1000&lt;/code&gt; subprocess&lt;/li&gt;
&lt;li&gt;Opt-in via &lt;code&gt;--apple-power-enabled&lt;/code&gt; flag (defaults off)&lt;/li&gt;
&lt;li&gt;NOPASSWD sudoers fragment with &lt;strong&gt;pinned argv&lt;/strong&gt; for safe install (security audit caught and fixed three findings before merge: argv pinning, &lt;code&gt;--powermetrics-bin&lt;/code&gt; override rejection, absolute &lt;code&gt;/usr/bin/sudo&lt;/code&gt; to defeat $PATH substitution attacks)&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;make install-powermetrics-sudo&lt;/code&gt; and &lt;code&gt;make uninstall-powermetrics-sudo&lt;/code&gt; targets (&lt;a href="https://github.com/defilantech/llmkube/pull/336" rel="noopener noreferrer"&gt;PR #336&lt;/a&gt;) so the privileged install is one command instead of a 5-line &lt;code&gt;sed&lt;/code&gt; + &lt;code&gt;visudo&lt;/code&gt; + &lt;code&gt;install&lt;/code&gt; shell incantation&lt;/li&gt;
&lt;li&gt;Coverage gap closed: extracted helper at 100% test coverage&lt;/li&gt;
&lt;li&gt;Zero impact on existing setups; without the flag, behavior is unchanged&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/defilantech/infercost/releases/tag/v0.3.0" rel="noopener noreferrer"&gt;InferCost v0.3.0&lt;/a&gt;: Apple Silicon (Metal) power collector&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds &lt;code&gt;internal/scraper/metal.go&lt;/code&gt; mirroring the existing DCGM scraper (&lt;a href="https://github.com/defilantech/infercost/pull/47" rel="noopener noreferrer"&gt;PR #47&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;MetalReachable&lt;/code&gt; condition with reasons &lt;code&gt;MetalHealthy / MetalNotConfigured / MetalScrapeError / MetalSamplerOff&lt;/code&gt; so operators on a Mac don't see "DCGM unreachable" messages&lt;/li&gt;
&lt;li&gt;10-line dispatcher in the CostProfile reconciler keys off &lt;code&gt;MetalEndpoint&lt;/code&gt; set + &lt;code&gt;looksApple(gpuModel)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;New &lt;code&gt;apple-m5-max.yaml&lt;/code&gt; sample CostProfile and updated &lt;code&gt;apple-m2-ultra.yaml&lt;/code&gt; with real setup steps&lt;/li&gt;
&lt;li&gt;8 controller tests + 5 scraper tests; existing DCGM tests untouched&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have a MacBook Pro M5 (or M3/M4 Max with enough memory), the full install is now five short steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install llama.cpp (needed by the Metal Agent for serving GGUF weights)&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;llama.cpp

&lt;span class="c"&gt;# 2. Install LLMKube via Helm&lt;/span&gt;
helm repo add llmkube https://defilantech.github.io/llmkube
helm &lt;span class="nb"&gt;install &lt;/span&gt;llmkube llmkube/llmkube &lt;span class="nt"&gt;--version&lt;/span&gt; 0.7.2

&lt;span class="c"&gt;# 3. Build + install the Metal Agent and grant powermetrics access&lt;/span&gt;
git clone https://github.com/defilantech/llmkube &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;llmkube
make install-metal-agent          &lt;span class="c"&gt;# builds + installs the launchd service&lt;/span&gt;
make install-powermetrics-sudo    &lt;span class="c"&gt;# one-command pinned-argv NOPASSWD sudoers install&lt;/span&gt;

&lt;span class="c"&gt;# 4. Restart the agent with --apple-power-enabled in your launchd plist&lt;/span&gt;
&lt;span class="c"&gt;#    (edit ~/Library/LaunchAgents/com.llmkube.metal-agent.plist, then reload)&lt;/span&gt;
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load   ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

&lt;span class="c"&gt;# 5. Deploy InferCost pointed at the agent and apply the sample CostProfile&lt;/span&gt;
helm repo add infercost https://defilantech.github.io/infercost
helm &lt;span class="nb"&gt;install &lt;/span&gt;infercost infercost/infercost &lt;span class="nt"&gt;--version&lt;/span&gt; 0.3.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; metal.endpoint&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:9090/metrics
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/defilantech/infercost/main/config/samples/costprofiles/apple-m5-max.yaml

&lt;span class="c"&gt;# Watch the live reconcile&lt;/span&gt;
kubectl get costprofile apple-m5-max &lt;span class="nt"&gt;-o&lt;/span&gt; yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;make install-powermetrics-sudo&lt;/code&gt; step is the one privileged moment: sudo prompts you for your password, the make target validates the sudoers syntax with &lt;code&gt;visudo -cf&lt;/code&gt; before installing, then echoes the granted command back so you can verify exactly what was authorized. The grant is scoped to &lt;code&gt;/usr/bin/powermetrics --samplers cpu_power\,gpu_power -i [0-9]*&lt;/code&gt; and nothing else. To remove it later, &lt;code&gt;make uninstall-powermetrics-sudo&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Edit &lt;code&gt;purchasePriceUSD&lt;/code&gt;, &lt;code&gt;electricity.ratePerKWh&lt;/code&gt;, and &lt;code&gt;nodeSelector&lt;/code&gt; in the CostProfile to match your reality.&lt;/p&gt;

&lt;p&gt;Both projects are open source and hungry for the kind of feedback that comes from running them on hardware we don't have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLMKube&lt;/strong&gt; (github.com/defilantech/llmkube). Kubernetes-native LLM serving operator. Runs llama.cpp and vLLM on NVIDIA, Metal Agent for Apple Silicon. Stars and &lt;code&gt;good-first-issue&lt;/code&gt; PRs both very welcome. The Metal Agent in particular benefits enormously from Mac-having developers running it through &lt;code&gt;--apple-power-enabled&lt;/code&gt;, finding the edge cases we missed, and filing issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;InferCost&lt;/strong&gt; (github.com/defilantech/infercost). Kubernetes-native AI FinOps. Cost attribution per workload, namespace, and model, with both NVIDIA (DCGM) and now Apple Silicon (this PR) power sources. The &lt;code&gt;UsageReport&lt;/code&gt; CRD is the next thing to push on; if you have a multi-Mac fleet or a mixed NVIDIA+Apple environment, we'd love to hear what reports would help your team.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10. Reproducibility
&lt;/h2&gt;

&lt;p&gt;Every number in this post traces back to a file you can pull or a benchmark you can re-run.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLMKube: github.com/defilantech/llmkube, main branch at commit &lt;code&gt;58a94a7&lt;/code&gt; (PR #334 merged). Issue #335 closed.&lt;/li&gt;
&lt;li&gt;InferCost: github.com/defilantech/infercost, main branch at commit &lt;code&gt;422a4f0&lt;/code&gt; (PR #47 merged). Issue #46 closed.&lt;/li&gt;
&lt;li&gt;Aider Polyglot harness: github.com/Aider-AI/aider with &lt;a href="https://github.com/Aider-AI/polyglot-benchmark" rel="noopener noreferrer"&gt;polyglot-benchmark&lt;/a&gt; exercises.&lt;/li&gt;
&lt;li&gt;Aider Polyglot leaderboard: &lt;a href="https://github.com/Aider-AI/aider/blob/main/aider/website/_data/polyglot_leaderboard.yml" rel="noopener noreferrer"&gt;polyglot_leaderboard.yml on aider main&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;evalplus: github.com/evalplus/evalplus, scored via &lt;code&gt;ganler/evalplus&lt;/code&gt; container for the macOS &lt;code&gt;RLIMIT_AS&lt;/code&gt; workaround.&lt;/li&gt;
&lt;li&gt;Run scripts: &lt;code&gt;aider/run-aider-polyglot.sh&lt;/code&gt; (Qwen) and &lt;code&gt;aider/run-aider-devstral.sh&lt;/code&gt; (Devstral) on this host, both straightforward Bash that invoke the Aider docker container with the right model id and edit format.&lt;/li&gt;
&lt;li&gt;Power + cost telemetry: &lt;code&gt;/tmp/infercost-m5max-telemetry.csv&lt;/code&gt; (388 power samples) and &lt;code&gt;/tmp/infercost-m5max-tokens.csv&lt;/code&gt; (333 llama-server token-counter samples). Window markers (&lt;code&gt;# QWEN_RUN_END&lt;/code&gt;, &lt;code&gt;# DEVSTRAL_RUN_START&lt;/code&gt;, etc.) inline in the CSV.&lt;/li&gt;
&lt;li&gt;Sample CostProfile: &lt;code&gt;config/samples/costprofiles/apple-m5-max.yaml&lt;/code&gt; in the InferCost repo.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a reproducer hits something different, please open an issue against whichever repo is the closest fit. The Apple Silicon path in particular is brand new, and the cohort of people who could give it a real workout is small but motivated.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;A few things the data points to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The InferCost &lt;code&gt;UsageReport&lt;/code&gt; CRD needs a real multi-day test on a Mac running mixed inference + idle. The active vs idle split is the FinOps lever for local models, and we have one day of data; we want a month.&lt;/li&gt;
&lt;li&gt;Multi-Mac fleet support in InferCost (auto-discovery of LLMKube Metal Agents via label selector) would let teams deploy InferCost once and have it follow agents around. Issue tracking that is open.&lt;/li&gt;
&lt;li&gt;We benched Devstral 2 on Aider and HumanEval+. We did &lt;em&gt;not&lt;/em&gt; bench it on its native scaffold (Mistral Vibe / OpenHands / Cline). That comparison is the right one for a daily-driver evaluation and it's the next thing we'll publish.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're running local LLM inference on your own hardware and care about either the serving side (LLMKube) or the cost side (InferCost), the easiest way to push these projects forward is to point them at your environment, file the issue you'd want to fix, and let us know what number would actually help your team.&lt;/p&gt;

&lt;p&gt;Both projects are Apache 2.0. Stars on &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt; and &lt;a href="https://github.com/defilantech/infercost" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt; are appreciated and signal the kind of validation that helps prioritize the next round of work.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>We ran Qwen3.6-27B on $800 of consumer GPUs, day one: llama.cpp vs vLLM</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Fri, 24 Apr 2026 03:06:36 +0000</pubDate>
      <link>https://dev.to/defilan/we-ran-qwen36-27b-on-800-of-consumer-gpus-day-one-llamacpp-vs-vllm-mg1</link>
      <guid>https://dev.to/defilan/we-ran-qwen36-27b-on-800-of-consumer-gpus-day-one-llamacpp-vs-vllm-mg1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://llmkube.com/blog/qwen3-6-27b-bakeoff" rel="noopener noreferrer"&gt;llmkube.com/blog/qwen3-6-27b-bakeoff&lt;/a&gt;. Cross-posted here for the dev.to audience.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A Kubernetes-native bake-off on 2× RTX 5060 Ti, with reproducible manifests and a cost-per-token number neither cloud nor OSS FinOps tools will tell you.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This is a runtime comparison, not a model evaluation.&lt;/strong&gt; Both llama.cpp and vLLM serve the same Qwen3.6-27B in every cell; we're measuring how the two serving stacks differ on identical work. Where cloud APIs enter in §8, it's on cost, not capability — this post makes no claim about whether Qwen3.6-27B "beats" GPT-4o or Claude on task quality.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3.6-27B&lt;/strong&gt; (Tongyi Lab, released 2026-04-21, Apache 2.0) runs on a pair of &lt;strong&gt;RTX 5060 Ti 16 GB&lt;/strong&gt; consumer cards via Kubernetes + LLMKube. Total hardware: about $800 street.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM wins throughput by 3 to 4×&lt;/strong&gt; at high concurrency thanks to NVFP4 and PagedAttention. &lt;strong&gt;llama.cpp plus TurboQuant wins context&lt;/strong&gt; — we served one 43K-token prompt end-to-end (a single captured sample; higher-concurrency cells timed out on our 300 s harness budget) on hardware where vLLM's in-memory cap is 16K.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per million tokens is two numbers&lt;/strong&gt;, not one: &lt;strong&gt;$0.13 amortized&lt;/strong&gt; (full cost of ownership) and &lt;strong&gt;$0.010 marginal&lt;/strong&gt; (electricity during active serving). At 32.7% utilization over the bench window, the 13× gap between them is the real FinOps conversation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Everything is reproducible.&lt;/strong&gt; Manifests, harness, and &lt;code&gt;summary.csv&lt;/code&gt; at &lt;a href="https://github.com/defilantech/llmkube-bench" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube-bench&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. Why we did this
&lt;/h2&gt;

&lt;p&gt;Two days ago, Tongyi Lab dropped Qwen3.6-27B with the claim it matches frontier agentic-coding models at the 27B parameter count. The community response was predictable: does this actually work locally, or is it another model that benchmarks well but nobody can run? (Note for readers comparing against Qwen3.6-35B-A3B: the 27B is the non-MoE sibling. None of the MoE-specific flags like &lt;code&gt;--cpu-moe&lt;/code&gt; apply here.)&lt;/p&gt;

&lt;p&gt;The ecosystem has a harder time answering "how should I serve it?" There are two dominant open-source inference runtimes for models like this, and they optimize for different things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; — ubiquitous, GGUF-based, broad quantization support, runs on almost anything with a GPU. Adopted by the hobbyist and homelab crowd. Recently grew TurboQuant KV-cache compression (&lt;a href="https://github.com/ggml-org/llama.cpp/discussions/20969" rel="noopener noreferrer"&gt;ggml-org/llama.cpp#20969&lt;/a&gt;), pushing achievable context windows on small VRAM into territory nobody else touches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; — throughput-focused, PagedAttention, continuous batching, FP8/NVFP4 on recent NVIDIA. The production serving runtime for teams running real traffic, targeting data center hardware.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ecosystem answers "which should I use" with vibes and forum posts. We wanted numbers — from the same hardware, same model, same day the model dropped. If a 27B-class model can genuinely run on a pair of $400 GPUs, the practical question for anyone thinking about on-prem inference is which runtime makes that hardware actually worth something.&lt;/p&gt;

&lt;p&gt;So we benchmarked both, published every configuration, and then turned the token counts into dollars using our companion tool &lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt;, so the "is it cheaper than the cloud?" question has an honest answer rather than the usual founder-math.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Hardware and the constraint
&lt;/h2&gt;

&lt;p&gt;The node running this bench is &lt;strong&gt;shadowstack&lt;/strong&gt; — a microk8s cluster on a single box:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPUs&lt;/td&gt;
&lt;td&gt;2× NVIDIA GeForce RTX 5060 Ti 16 GB (Blackwell GB206)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU memory&lt;/td&gt;
&lt;td&gt;15.48 GiB usable per card after driver reserve (30.96 GiB aggregate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04.3 LTS, kernel 6.17.0-oem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes&lt;/td&gt;
&lt;td&gt;MicroK8s v1.32.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;LLMKube operator (chart 0.7.0) + NVIDIA GPU Operator + DCGM exporter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Street price&lt;/td&gt;
&lt;td&gt;about $400/card × 2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;5060 Ti is a &lt;strong&gt;Blackwell consumer GPU with native FP4 hardware&lt;/strong&gt;. That is load-bearing. Without NVFP4, the 27B class is out of reach. At BF16 the model would need about 55 GB, at FP8 about 28 GB, at NVFP4 about 14 GB. Only the last one fits 2× 16 GB with room for activations and KV cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The VRAM budget is the whole story.&lt;/strong&gt; On enterprise hardware (H100, A100, even the 3090 that the community's "qwen 27B on a 3090" discourse is built on), most of this bake-off's complexity disappears. On 2× 16 GB consumer cards you are constantly one configuration flag away from an out-of-memory crash, and the runtime that lets you navigate that wins real users.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The first attempt that didn't work
&lt;/h2&gt;

&lt;p&gt;Our original target was &lt;code&gt;Qwen/Qwen3.5-27B-FP8&lt;/code&gt; (Qwen's official FP8 safetensors, the model everyone was excited about). On paper: 28 GB weights, TP=2, about 14 GB per shard. Should fit.&lt;/p&gt;

&lt;p&gt;It doesn't. Qwen's 27B-class FP8 release is a &lt;strong&gt;VLM&lt;/strong&gt; — the checkpoint includes a vision encoder that stays resident in VRAM whether or not you ever send an image. Three successive mitigations on vLLM, each measured against the crash logs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Default config.&lt;/strong&gt; OOM during &lt;code&gt;profile_run&lt;/code&gt; on the vision encoder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CUDA out of memory. Tried to allocate 576.00 MiB.
GPU 0 has a total capacity of 15.48 GiB of which 175.19 MiB is free.
This process has 15.30 GiB memory in use.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. &lt;code&gt;--limit-mm-per-prompt image=0,video=0&lt;/code&gt;, &lt;code&gt;maxModelLen&lt;/code&gt; 16K, &lt;code&gt;max-num-batched-tokens&lt;/code&gt; 4K.&lt;/strong&gt; Skipped multimodal dummy inputs during profile. The vision encoder weights stay resident. OOM now at &lt;code&gt;determine_available_memory&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tried to allocate 1.19 GiB.
GPU 0 has 1.02 GiB free.
This process has 14.45 GiB in use.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. &lt;code&gt;--gpu-memory-utilization 0.95&lt;/code&gt;, &lt;code&gt;PYTORCH_ALLOC_CONF=expandable_segments:True&lt;/code&gt;.&lt;/strong&gt; Pushed against the wall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tried to allocate 32.00 MiB.
GPU 0 has 3.19 MiB free.
This process has 15.47 GiB in use.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;15.47 of 15.48 GiB. No knob left. &lt;strong&gt;Qwen3.5-27B-FP8 cannot be served via vLLM on 2× 16 GB consumer cards in any configuration we found.&lt;/strong&gt; A 3090 or 4090 (24 GB) would have considerably more headroom for the vision encoder plus KV cache (we didn't reproduce on one, but it's plausible the default config would fit there). That's a real hardware-sizing footnote to the "run 27B locally" discourse, since not every pair of 16 GB cards is enough.&lt;/p&gt;

&lt;p&gt;Then Qwen3.6-27B dropped, and within 24 hours the community had published &lt;strong&gt;NVFP4&lt;/strong&gt; quants that halve the weight footprint again. That is the pivot that made this bench possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Method
&lt;/h2&gt;

&lt;p&gt;Both runtimes run Qwen3.6-27B, served via LLMKube as a Kubernetes Deployment with OpenAI-compatible endpoints, and are benchmarked against each other on identical workloads. All manifests live in the public repo.&lt;/p&gt;

&lt;h3&gt;
  
  
  llama.cpp candidate
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;unsloth/Qwen3.6-27B-GGUF&lt;/code&gt; Q4_K_M (~17 GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallelism&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;split-mode=layer&lt;/code&gt; across both GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TurboQuant&lt;/strong&gt; &lt;code&gt;tbqp3&lt;/code&gt; (keys) + &lt;code&gt;tbq3&lt;/code&gt; (values) — about 3 bits/element&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max context&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;65,536&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image&lt;/td&gt;
&lt;td&gt;AmesianX's TurboQuant fork v1.5.2, built from source (Kaniko manifest in the bench repo; retarget to your own registry to reproduce)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flash attention&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel slots&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;16 for short patterns&lt;/strong&gt; (chat, coding, agentic), &lt;strong&gt;1 for long-context patterns&lt;/strong&gt; (&lt;code&gt;long_context&lt;/code&gt;, &lt;code&gt;long_context_extreme&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TurboQuant is AmesianX's llama.cpp fork implementing the KV-cache compression algorithm from &lt;a href="https://arxiv.org/pdf/2504.19874" rel="noopener noreferrer"&gt;Google Research's TurboQuant paper&lt;/a&gt;. Asymmetric: QJL correction (tbqp*) on keys only because keys feed Q·K inner products while values go through a softmax-weighted sum. Our own internal benchmarks show about 60% KV cache reduction vs f16 at the same context, the table stakes for pushing context on small VRAM.&lt;/p&gt;

&lt;p&gt;The slot count asymmetry matters and we want to be upfront about it: llama.cpp divides &lt;code&gt;--ctx-size&lt;/code&gt; by &lt;code&gt;--parallel&lt;/code&gt; to get per-slot context. With &lt;code&gt;parallelSlots=16&lt;/code&gt; and 65K total context, each slot gets 4 K tokens, which is enough for chat/coding/agentic prompts but rejects 5 K+ long-context requests. Dropping to &lt;code&gt;parallelSlots=1&lt;/code&gt; gives every request the full 65 K, at the cost of serving concurrent long-context requests from a queue. Readers should treat llama.cpp's &lt;code&gt;long_context&lt;/code&gt; c=16/c=64 numbers as queue-behavior measurements, not throughput measurements.&lt;/p&gt;

&lt;h3&gt;
  
  
  vLLM candidate
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sakamakismile/Qwen3.6-27B-NVFP4&lt;/code&gt; (~14 GB)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallelism&lt;/td&gt;
&lt;td&gt;tensor-parallel (TP=2)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quantization&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;compressed-tensors&lt;/code&gt; wrapping NVFP4 (Blackwell-native 4-bit float)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache&lt;/td&gt;
&lt;td&gt;FP8 E4M3 (8 bits)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max context&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16,384&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Attention backend&lt;/td&gt;
&lt;td&gt;FLASHINFER&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CUDA graphs&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;disabled&lt;/strong&gt; (&lt;code&gt;--enforce-eager&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefix caching&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunked prefill&lt;/td&gt;
&lt;td&gt;on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vllm/vllm-openai:latest&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two forced choices here deserve a note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--enforce-eager&lt;/code&gt;&lt;/strong&gt; because CUDA graph capture for NVFP4 plus VLM weights plus KV cache exhausts the 15.48 GiB budget before KV init even starts. Skipping graph capture costs about 10 to 15% throughput, which becomes part of the fair comparison: on this hardware class vLLM gives up one of its own optimizations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;maxModelLen: 16384&lt;/code&gt;&lt;/strong&gt; is not "the model's ceiling". It is what fits after NVFP4 weights (14 GB / 2 = 7 GB/shard), vision encoder (~2 GB), KV cache at FP8, and activations. 32K OOMs during profile; 16K fits with about 1 GiB headroom.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Workloads
&lt;/h3&gt;

&lt;p&gt;Five patterns × four concurrency levels per runtime:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Shape&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chat&lt;/td&gt;
&lt;td&gt;128-in / 256-out, 20 prompts&lt;/td&gt;
&lt;td&gt;Interactive baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;coding&lt;/td&gt;
&lt;td&gt;1K-in / 1K-out, 20 prompts&lt;/td&gt;
&lt;td&gt;Typical code-gen turn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;long_context&lt;/td&gt;
&lt;td&gt;~5K-in / 1K-out, 10 prompts&lt;/td&gt;
&lt;td&gt;Code review, RAG-heavy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;long_context_extreme&lt;/td&gt;
&lt;td&gt;~43K-in / 1K-out, 10 prompts&lt;/td&gt;
&lt;td&gt;vLLM's 16K cap cannot attempt this&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agentic&lt;/td&gt;
&lt;td&gt;4K shared prefix + 512 delta / 512-out, 20 prompts&lt;/td&gt;
&lt;td&gt;Stresses prefix caching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Concurrency &lt;code&gt;1, 4, 16, 64&lt;/code&gt;. Per cell: 2 min warmup (discarded) + 5 min measurement. Temperature 0, seed 42, streaming on.&lt;/p&gt;

&lt;p&gt;The full workload matrix is 40 cells (5 × 4 × 2 runtimes). We run 36 of them. &lt;code&gt;long_context_extreme&lt;/code&gt; is not attempted on vLLM because its 16K cap would reject every prompt before submission. That asymmetry is one of the bake-off's findings, not a methodology gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Results: throughput and latency
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Single-request latency (c=1)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pattern&lt;/th&gt;
&lt;th&gt;llama.cpp TTFT p50&lt;/th&gt;
&lt;th&gt;vLLM TTFT p50&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chat&lt;/td&gt;
&lt;td&gt;208 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;157 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;coding&lt;/td&gt;
&lt;td&gt;413 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;106 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agentic&lt;/td&gt;
&lt;td&gt;911 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;409 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;long_context (5K)&lt;/td&gt;
&lt;td&gt;2,279 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;581 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;vLLM is faster at single-request latency across the board, typically 2 to 4× on prefill-heavy patterns. llama.cpp plus TurboQuant pays a prefill tax: compressing the KV cache to about 3 bits per element is memory-cheap and compute-expensive. On short prompts the gap is narrow; on long prompts it opens up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quantization caveat:&lt;/strong&gt; these numbers compare Q4_K_M (llama.cpp) against NVFP4 (vLLM). They are not the same quantization, and on this hardware there is no apples-to-apples option: llama.cpp doesn't ship an NVFP4 runtime, and Q4_K_M has no vLLM implementation. We've filled out a side-by-side output-quality check in &lt;a href="https://github.com/defilantech/llmkube-bench/blob/main/docs/QUALITY-GATE.md" rel="noopener noreferrer"&gt;QUALITY-GATE.md&lt;/a&gt; so readers can judge whether the two quants produce comparable answers at this parameter count. Read the speed numbers as "at each runtime's native quant on this hardware," not "at identical model quality."&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput under load (c=64)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pattern&lt;/th&gt;
&lt;th&gt;llama.cpp tok/s&lt;/th&gt;
&lt;th&gt;vLLM tok/s&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;chat&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;345&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.7×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;coding&lt;/td&gt;
&lt;td&gt;133 (60% success)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;377&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.8×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;agentic&lt;/td&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;262&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.6×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is vLLM's home turf. PagedAttention plus continuous batching turn 64 concurrent requests into about 90% GPU utilization; llama.cpp's slot-based scheduling (even with 16 parallel slots) serializes far more aggressively. The coding c=64 drop to 60% success on llama.cpp is KV cache saturation: at 16 slots by about 2K per-slot context, heavy coding prompts overflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inter-token latency
&lt;/h3&gt;

&lt;p&gt;Stable and tight on both runtimes. Median ITL:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp:&lt;/strong&gt; 49 to 175 ms/token across patterns and concurrencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM:&lt;/strong&gt; 64 to 67 ms/token across patterns and concurrencies (remarkably flat, because continuous batching amortizes decode across the batch)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The llama.cpp ITL spread widens at high concurrency as slot contention kicks in. vLLM's is basically a constant, which is what makes it good for conversational workloads where you care about per-token cadence.&lt;/p&gt;

&lt;h3&gt;
  
  
  The honest version
&lt;/h3&gt;

&lt;p&gt;vLLM wins the throughput axis. That's a real result, not a function of tuning. On 2× 16 GB consumer hardware with Qwen3.6-27B, &lt;strong&gt;if you're trying to maximize requests per second, vLLM is the answer&lt;/strong&gt;, and it wins while giving up about 10 to 15% of its own throughput to &lt;code&gt;--enforce-eager&lt;/code&gt; (disabled CUDA graphs were required to fit VRAM). The NVFP4 kernels on Blackwell, PagedAttention's batching, and continuous prefill scheduling all compound even with that handicap.&lt;/p&gt;

&lt;p&gt;Except…&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Results: context
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The 5K baseline
&lt;/h3&gt;

&lt;p&gt;Both runtimes serve &lt;code&gt;long_context&lt;/code&gt; (about 5K input tokens, 1K output) at c=1 in about 13 seconds end-to-end. llama.cpp measures 20 tok/s, vLLM 19 tok/s. &lt;strong&gt;Near parity&lt;/strong&gt; at this context size.&lt;/p&gt;

&lt;p&gt;At higher concurrency the story differs because we configured llama.cpp with &lt;code&gt;parallelSlots=1&lt;/code&gt; to give every request the full 65K context (required for the extreme pattern, see below). Concurrency c=16 and c=64 on llama.cpp show queue saturation: the harness sends 16 or 64 concurrent requests, but the server processes them serially. That's not a throughput measurement, it's a queue measurement. On production llama.cpp with &lt;code&gt;parallelSlots=16&lt;/code&gt; and a smaller per-request context, short-prompt throughput would match our earlier numbers, but then you can't serve 43K prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which brings us to the real test
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;long_context_extreme: a roughly 43,000-token prompt in, 1024 tokens out.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vLLM, as configured here, can't attempt this.&lt;/strong&gt; Its &lt;code&gt;maxModelLen&lt;/code&gt; is 16K, set that way because 32K OOMs during graph capture on this hardware. A 43K-token request is rejected before it reaches inference. We did not explore &lt;code&gt;--swap-space&lt;/code&gt; CPU offload, which in principle could trade a lot of latency for more context; that's a follow-up. Out of the box on 2× 16 GB consumer cards with Qwen3.6-27B NVFP4, we did not find an in-memory configuration that serves 43K.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;llama.cpp plus TurboQuant served it.&lt;/strong&gt; One sample captured at c=16 end-to-end:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt tokens: about 43,000&lt;/li&gt;
&lt;li&gt;Prefill time (TTFT): &lt;strong&gt;186 seconds&lt;/strong&gt; (3.1 min)&lt;/li&gt;
&lt;li&gt;Decode rate: &lt;strong&gt;171 ms/token&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Output: 1024 tokens in about 175 seconds&lt;/li&gt;
&lt;li&gt;Total wall time: about 6 minutes per request&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not fast. It's not meant to be fast. What it is, is &lt;strong&gt;possible&lt;/strong&gt;. TurboQuant's roughly 3-bit KV cache makes the memory math work where FP16 or FP8 KV can't. On the same hardware, at the same moment, one runtime cannot attempt the workload and the other completes it.&lt;/p&gt;

&lt;p&gt;The higher-concurrency cells for this pattern hit our harness's 300s per-request timeout because decode plus prefill combined exceeds 300s. Bumping the harness timeout to 600s would capture all four c-levels cleanly; that's a follow-up. The c=1 and c=16 samples are enough to prove the capability.&lt;/p&gt;

&lt;h3&gt;
  
  
  The real tradeoff
&lt;/h3&gt;

&lt;p&gt;Throughput versus context is the tradeoff, not "vLLM is better" or "llama.cpp is better". On this hardware:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production chat, interactive coding, short agentic loops&lt;/strong&gt; (≤ 8K context): &lt;strong&gt;vLLM.&lt;/strong&gt; 3 to 4× throughput, lower TTFT, better ITL stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-document review, RAG with full-file context, overnight batch agentic on 40K+ codebases&lt;/strong&gt; (&amp;gt; 16K context): &lt;strong&gt;llama.cpp plus TurboQuant.&lt;/strong&gt; Slower per token, but it's the only runtime that serves the workload at all.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For many real workloads the answer is "run both." vLLM for the chat endpoint, llama.cpp for the batch endpoint that processes whole PRs overnight.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. What it costs
&lt;/h2&gt;

&lt;p&gt;Throughput numbers are interesting. Dollars per token are what actually get budgets approved.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://infercost.ai" rel="noopener noreferrer"&gt;InferCost&lt;/a&gt; is our companion tool: a Kubernetes operator that reads real-time GPU power draw from DCGM, combines it with hardware amortization and electricity rates declared on a &lt;code&gt;CostProfile&lt;/code&gt; CR, and computes the real cost of inference. It discovers inference pods by the &lt;code&gt;inference.llmkube.dev/model&lt;/code&gt; label LLMKube stamps on each Deployment, scrapes each pod's &lt;code&gt;/metrics&lt;/code&gt; endpoint directly (no Prometheus required), and writes cost attribution into a &lt;code&gt;UsageReport&lt;/code&gt; custom resource.&lt;/p&gt;

&lt;p&gt;Here's a live &lt;code&gt;UsageReport&lt;/code&gt; status from shadowstack, captured after a 10-minute mixed workload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;$ kubectl -n bench get usagereport bench-window -o yaml&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-23"&lt;/span&gt;
  &lt;span class="na"&gt;periodStart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-23T00:00:00Z"&lt;/span&gt;
  &lt;span class="na"&gt;periodEnd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-23T21:21:42Z"&lt;/span&gt;
  &lt;span class="na"&gt;inputTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;638&lt;/span&gt;
  &lt;span class="na"&gt;outputTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;12400&lt;/span&gt;
  &lt;span class="na"&gt;activeEnergyKWh&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="m"&gt;0.645&lt;/span&gt;
  &lt;span class="na"&gt;activeHoursInPeriod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4.53&lt;/span&gt;
  &lt;span class="na"&gt;totalHoursInPeriod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;21.36&lt;/span&gt;
  &lt;span class="na"&gt;utilizationPercent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;21.20&lt;/span&gt;
  &lt;span class="na"&gt;estimatedCostUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;             &lt;span class="m"&gt;0.83&lt;/span&gt;
  &lt;span class="na"&gt;costPerMillionTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;         &lt;span class="m"&gt;63.79&lt;/span&gt;   &lt;span class="c1"&gt;# amortized&lt;/span&gt;
  &lt;span class="na"&gt;marginalCostPerMillionTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;3.96&lt;/span&gt;   &lt;span class="c1"&gt;# electricity during active serving&lt;/span&gt;
  &lt;span class="na"&gt;byModel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;qwen36-27b-llamacpp&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bench&lt;/span&gt;
    &lt;span class="na"&gt;inputTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="m"&gt;638&lt;/span&gt;
    &lt;span class="na"&gt;outputTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;12400&lt;/span&gt;
    &lt;span class="na"&gt;costPerMillionTokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;63.79&lt;/span&gt;
    &lt;span class="na"&gt;estimatedCostUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.83&lt;/span&gt;
  &lt;span class="na"&gt;byNamespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bench&lt;/span&gt;
    &lt;span class="na"&gt;tokenCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;13038&lt;/span&gt;
    &lt;span class="na"&gt;estimatedCostUSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.83&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The numbers look alarming at first: &lt;strong&gt;$63.79/MTok amortized&lt;/strong&gt; for a tiny workload against a day's worth of hardware amortization. That's the point. At 21.2% utilization over this window, amortized is &lt;strong&gt;16× higher than marginal&lt;/strong&gt;. Scale up the utilization and the amortized number drops toward the marginal one; that's what the bench window numbers below capture.&lt;/p&gt;

&lt;p&gt;The full bench window (Apr 23, 2026, 00:00 UTC → 10:07 UTC, ~10 hours), from &lt;code&gt;summary.csv&lt;/code&gt; cross-referenced with the &lt;code&gt;CostProfile&lt;/code&gt; spec:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total input tokens&lt;/td&gt;
&lt;td&gt;2,518,242&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total output tokens&lt;/td&gt;
&lt;td&gt;1,233,143&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;3,751,385&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active GPU energy&lt;/td&gt;
&lt;td&gt;0.459 kWh&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Utilization (active hours / wall-clock hours)&lt;/td&gt;
&lt;td&gt;32.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total dollar cost (amortization + electricity)&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Hardware amortization on the &lt;code&gt;CostProfile&lt;/code&gt; spec: 2× RTX 5060 Ti at $480 each = $960, 3-year useful life, 5% annual maintenance. Electricity $0.08/kWh, PUE 1.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  The two numbers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Which question it answers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;&lt;code&gt;costPerMillionTokens&lt;/code&gt;&lt;/strong&gt; (amortized)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.13&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What did my hardware cost per token I served today?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;marginalCostPerMillionTokens&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.010&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What did the electricity actually cost to generate those tokens?"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both numbers are correct. They answer different questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amortized $0.13/MTok&lt;/strong&gt; spreads the full cost of hardware ownership (amortization, idle electricity, active electricity) across whatever tokens you served today. It tells you the answer to "was today's inference worth what we paid for the hardware?" At 32.7% utilization, you're leaving about two-thirds of the compute capacity you already bought idle, and the amortized rate reflects that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Marginal $0.010/MTok&lt;/strong&gt; includes only the electricity drawn during active serving. It answers "what did these specific tokens cost me beyond what I'd be paying anyway?", the relevant comparison when cloud APIs only bill marginally.&lt;/p&gt;

&lt;p&gt;The 13× gap between them is the entire FinOps conversation. At 100% utilization the two numbers converge; at low utilization they diverge by more than an order of magnitude. Neither is the "right" number. They describe different things.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Cloud comparison
&lt;/h2&gt;

&lt;p&gt;Cloud APIs bill marginally. That's how they work: no inference, no invoice. So the fair comparison against on-prem is &lt;strong&gt;marginal versus marginal&lt;/strong&gt;. Cloud prices below are &lt;strong&gt;output token pricing&lt;/strong&gt; on public pricing pages as of April 2026; check each provider for current rates and input-vs-output splits.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider / Model&lt;/th&gt;
&lt;th&gt;Output $/MTok&lt;/th&gt;
&lt;th&gt;On-prem ratio (marginal)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;shadowstack marginal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.010&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI GPT-4o&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;1,000× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;1,000× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Claude Opus 4.5&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;2,500× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Those ratios are almost offensive. They're also the upper bound — the &lt;strong&gt;ceiling of savings if you saturated this hardware&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The floor, at the bench window's 32.7% utilization (i.e., our actual mixed-workload cost over ten hours), uses the amortized number:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider / Model&lt;/th&gt;
&lt;th&gt;Output $/MTok&lt;/th&gt;
&lt;th&gt;On-prem ratio (amortized at 32.7%)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;shadowstack amortized&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.13&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI GPT-4o&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;77× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;77× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Claude Opus 4.5&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;192× cheaper on-prem&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Even the worst case, amortized cost at 32.7% utilization, is &lt;strong&gt;77× cheaper than GPT-4o or Gemini 2.5 Pro&lt;/strong&gt; on output tokens. Against Claude Opus 4.5 (Anthropic's flagship large-frontier model), on-prem is 192× cheaper dollars-for-dollars. Those numbers do narrow on a blended input-plus-output basis, but the direction doesn't change.&lt;/p&gt;

&lt;p&gt;For context on the hardware investment: $960 of GPUs pays for itself in Opus 4.5 output tokens at roughly &lt;strong&gt;38.4 million tokens of traffic&lt;/strong&gt;. At a modest 100K output tokens a day that's about a year; at 1M output tokens a day (a small agentic coding team), it's under six weeks. Against GPT-4o or Gemini 2.5 Pro the break-even point is 96M output tokens: ~2.6 years at 100K/day, ~3 months at 1M/day. Input tokens are cheaper on every cloud model, so a realistic blended workload stretches those numbers modestly, but not by an order of magnitude.&lt;/p&gt;

&lt;p&gt;This math is why enterprises with serious inference budgets are re-examining on-prem. It's not about paranoia or data residency (though those help). It's that the marginal economics on modern consumer GPUs, with the right runtime, genuinely work.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Reproduce it yourself
&lt;/h2&gt;

&lt;p&gt;Everything is in the public repo: &lt;a href="https://github.com/defilantech/llmkube-bench" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube-bench&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Requires: K8s cluster with LLMKube v0.7+, 2× NVIDIA 16+ GB, DCGM exporter,&lt;/span&gt;
&lt;span class="c"&gt;# hf-token Secret in the bench namespace.&lt;/span&gt;
git clone https://github.com/defilantech/llmkube-bench.git
&lt;span class="nb"&gt;cd &lt;/span&gt;llmkube-bench
make &lt;span class="nb"&gt;install&lt;/span&gt;                                      &lt;span class="c"&gt;# Python deps via uv&lt;/span&gt;
make bench &lt;span class="nv"&gt;RESULTS_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;results/&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%F&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="nt"&gt;-myhw&lt;/span&gt;   &lt;span class="c"&gt;# ~3-4 hours for full matrix&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the workstation path. The bench also runs &lt;strong&gt;fully in-cluster&lt;/strong&gt; — a Kaniko Job builds the harness image, a bench-runner Job with a scoped ServiceAccount orchestrates the runtime swaps, results land on a hostPath volume. See &lt;code&gt;manifests/bench-runner/README.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every number in this post traces to a row in &lt;code&gt;results/2026-04-23-shadowstack/summary.csv&lt;/code&gt;. Every manifest, every image digest, every Prometheus snapshot is committed.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. What's next
&lt;/h2&gt;

&lt;p&gt;A few things we'd do differently on the next bench:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raise the harness per-request timeout&lt;/strong&gt; from 300s to 600s so &lt;code&gt;long_context_extreme&lt;/code&gt; at higher concurrencies captures cleanly. The one sample we got is defensible; four clean samples would be better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test with Qwen's own FP4 release&lt;/strong&gt; once they ship one. The &lt;code&gt;sakamakismile&lt;/code&gt; community NVFP4 has been solid for the throughput measurements, but an official Qwen FP4 would remove a variable from the methodology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-node llama.cpp&lt;/strong&gt; would close the long-context throughput gap. Splitting layers across 4 GPUs instead of 2 gives per-shard VRAM headroom for higher &lt;code&gt;--parallel&lt;/code&gt; settings and cuts the TurboQuant prefill time roughly in half.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the big-picture answer is already here. On $800 of consumer GPUs, you can serve the same day's flagship open-source model, at either throughput that crushes cloud APIs or context lengths that no cloud provider offers at any price. And InferCost shows you the honest dollar math instead of the misleading single-number dashboards you'd get from every "AI observability" tool on the market.&lt;/p&gt;

&lt;p&gt;If you want to follow along:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube&lt;/a&gt; — the Kubernetes operator running both runtimes in this bench&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/defilantech/infercost" rel="noopener noreferrer"&gt;github.com/defilantech/infercost&lt;/a&gt; — the cost attribution controller producing the $/MTok numbers&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/defilantech/llmkube-bench" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube-bench&lt;/a&gt; — the full reproducible bench&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/defilan" rel="noopener noreferrer"&gt;@defilan on X&lt;/a&gt; — where the threads go&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this was useful, star the repos. If it was wrong about something, open an issue; the goal is accurate numbers, not winning arguments.&lt;/p&gt;

&lt;p&gt;— Chris&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>LLMKube Now Deploys Any Inference Engine, Not Just llama.cpp</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Wed, 08 Apr 2026 01:03:15 +0000</pubDate>
      <link>https://dev.to/defilan/llmkube-now-deploys-any-inference-engine-not-just-llamacpp-fpm</link>
      <guid>https://dev.to/defilan/llmkube-now-deploys-any-inference-engine-not-just-llamacpp-fpm</guid>
      <description>&lt;p&gt;LLMKube started as a Kubernetes operator for llama.cpp. You define a Model, define an InferenceService, and the controller handles GPU scheduling, health probes, model downloads, and Prometheus metrics. It works well for GGUF models.&lt;/p&gt;

&lt;p&gt;But llama.cpp isn't the only inference engine. vLLM has PagedAttention. TGI has continuous batching. PersonaPlex does real-time voice AI. Triton serves multi-framework models. Locking the operator to one runtime limits what you can deploy.&lt;/p&gt;

&lt;p&gt;v0.6.0 changes that with pluggable runtime backends.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Before v0.6.0, the controller's &lt;code&gt;constructDeployment()&lt;/code&gt; was hardcoded to llama.cpp. Container name, image, command-line args, health probes, model provisioning, everything assumed llama.cpp. If you wanted to deploy vLLM, you had to create a manual Kubernetes Deployment outside of LLMKube.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;RuntimeBackend&lt;/code&gt; interface that each inference engine implements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;RuntimeBackend&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ContainerName&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;DefaultImage&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;DefaultPort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;
    &lt;span class="n"&gt;BuildArgs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isvc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;modelPath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;BuildProbes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;liveness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;readiness&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;NeedsModelInit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller calls &lt;code&gt;resolveBackend(isvc)&lt;/code&gt; based on the &lt;code&gt;runtime&lt;/code&gt; field in the CRD, then delegates all container configuration to the backend. llama.cpp is the default. New runtimes register in a simple switch statement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing It: PersonaPlex on Kubernetes
&lt;/h2&gt;

&lt;p&gt;To prove the architecture works, I deployed NVIDIA's PersonaPlex on my home lab. PersonaPlex is a 7B speech-to-speech model based on Moshi. It listens and talks at the same time. Sub-300ms latency for interruptions. Completely different from llama.cpp: PyTorch runtime, WebSocket-based health checks, model downloaded via HuggingFace token.&lt;/p&gt;

&lt;p&gt;The InferenceService CRD:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;voice-ai&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex-7b&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;personaplex&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.defilan.net/personaplex:7b-v1-4bit-cuda13&lt;/span&gt;
  &lt;span class="na"&gt;personaPlexConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;quantize4Bit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;hfTokenSecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hf-token&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_TOKEN&lt;/span&gt;
  &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8998&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NodePort&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kubectl apply&lt;/code&gt; and it's running. The controller:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sets the container command to &lt;code&gt;python -m moshi.server&lt;/code&gt; (via the PersonaPlex backend's &lt;code&gt;CommandBuilder&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Configures TCP socket probes on port 8998 (PersonaPlex uses WebSockets, not HTTP /health)&lt;/li&gt;
&lt;li&gt;Injects &lt;code&gt;HF_TOKEN&lt;/code&gt; from a Kubernetes Secret and &lt;code&gt;NO_TORCH_COMPILE&lt;/code&gt; env var&lt;/li&gt;
&lt;li&gt;Skips the model download init container (model downloads at startup via HF Hub)&lt;/li&gt;
&lt;li&gt;Requests 1 GPU with 32Gi memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: real-time voice conversation running on a single RTX 5060 Ti, managed by the same operator that handles my llama.cpp text inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-in vLLM Runtime
&lt;/h2&gt;

&lt;p&gt;vLLM is probably the most requested inference engine in the Kubernetes ecosystem. v0.6.0 ships it as a first-class runtime with typed &lt;code&gt;VLLMConfig&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-tinyllama&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tinyllama-1b&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm/vllm-openai:cu130-nightly&lt;/span&gt;
  &lt;span class="na"&gt;skipModelInit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;vllmConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;maxModelLen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2048&lt;/span&gt;
    &lt;span class="na"&gt;dtype&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;float16&lt;/span&gt;
    &lt;span class="na"&gt;hfTokenSecretRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hf-token&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_TOKEN&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller generates the right args (&lt;code&gt;--model&lt;/code&gt;, &lt;code&gt;--tensor-parallel-size&lt;/code&gt;, &lt;code&gt;--max-model-len&lt;/code&gt;, &lt;code&gt;--quantization&lt;/code&gt;, &lt;code&gt;--dtype&lt;/code&gt;), configures HTTP &lt;code&gt;/health&lt;/code&gt; probes on port 8000, and injects HF_TOKEN from a Secret. I tested this on my cluster with TinyLlama-1.1B and got a working OpenAI-compatible endpoint in under two minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-in TGI Runtime
&lt;/h2&gt;

&lt;p&gt;HuggingFace's Text Generation Inference also ships as a built-in runtime. TGI downloads models directly from HuggingFace Hub, so &lt;code&gt;skipModelInit&lt;/code&gt; isn't even needed. The &lt;code&gt;TGIConfig&lt;/code&gt; supports quantization methods (bitsandbytes, gptq, awq, eetq), max token limits, and dtype.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Generic Runtime
&lt;/h2&gt;

&lt;p&gt;Not every inference engine needs first-class support. The &lt;code&gt;generic&lt;/code&gt; runtime lets you deploy any container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;generic&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-custom-server:latest&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/app/serve"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;skipModelInit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;probeOverrides&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;startup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tcpSocket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You provide the image, args, probes, and env. The controller handles GPU scheduling, service creation, and lifecycle management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Runtime Autoscaling
&lt;/h2&gt;

&lt;p&gt;Each runtime defines its default HPA metric via the &lt;code&gt;HPAMetricProvider&lt;/code&gt; interface. When you enable autoscaling without specifying a metric, the controller picks the right one for your runtime:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;llama.cpp: &lt;code&gt;llamacpp:requests_processing&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;vLLM: &lt;code&gt;vllm:num_requests_running&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;TGI: &lt;code&gt;tgi:queue_size&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No more hardcoded metric names.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding Your Own Runtime
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;docs/adding-a-runtime.md&lt;/code&gt; documents the full process: implement the &lt;code&gt;RuntimeBackend&lt;/code&gt; interface, optionally add &lt;code&gt;CommandBuilder&lt;/code&gt;, &lt;code&gt;EnvBuilder&lt;/code&gt;, or &lt;code&gt;HPAMetricProvider&lt;/code&gt;, register in the switch statement, add your CRD config struct, and run &lt;code&gt;make manifests generate&lt;/code&gt;. The pattern is established with five working examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Everything Else in v0.6.0
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CUDA 13 default image for RTX 50-series and Qwen3.5 support&lt;/li&gt;
&lt;li&gt;Custom GPU layer splits for multi-GPU sharding&lt;/li&gt;
&lt;li&gt;Helm image registry/repository separation for air-gapped deployments&lt;/li&gt;
&lt;li&gt;Grafana inference metrics dashboard (tokens/sec, queue depth, KV cache, reconcile health)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;imagePullSecrets&lt;/code&gt; on InferenceService for private registries&lt;/li&gt;
&lt;li&gt;HPA autoscaling for InferenceService&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Triton Inference Server and Ollama as built-in runtimes. Better Model controller support for non-GGUF formats (HuggingFace repo IDs as sources). And potentially Kubernetes-native voice AI pipelines combining PersonaPlex with LLMKube-managed reasoning models.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;https://github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>I tested speculative decoding on my home GPU cluster. Here's why it didn't help.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 06 Apr 2026 03:51:51 +0000</pubDate>
      <link>https://dev.to/defilan/i-tested-speculative-decoding-on-my-home-gpu-cluster-heres-why-it-didnt-help-3ej6</link>
      <guid>https://dev.to/defilan/i-tested-speculative-decoding-on-my-home-gpu-cluster-heres-why-it-didnt-help-3ej6</guid>
      <description>&lt;p&gt;I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel.&lt;/p&gt;

&lt;p&gt;I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp.&lt;/p&gt;

&lt;p&gt;For this test I deployed two models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemma 4 26B-A4B&lt;/strong&gt;: Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs at 88 tok/s on my setup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3-32B&lt;/strong&gt;: A dense 32B model. All parameters active per token. Runs at 20 tok/s.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both running Q4_K_M quantization, flash attention enabled, 8K context, split across both GPUs.&lt;/p&gt;

&lt;p&gt;Quick note on why the MoE model is so much faster: Gemma 4 only activates a fraction of its parameters per token, so there's way less weight data to read from VRAM on each forward pass. MoE routing overhead eats into some of that advantage, but it's still a huge win on bandwidth-constrained hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tested
&lt;/h2&gt;

&lt;p&gt;llama.cpp has built-in n-gram speculative decoding. No draft model needed, you just pass a few flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;--spec-type&lt;/span&gt; ngram-mod
&lt;span class="nt"&gt;--draft-max&lt;/span&gt; 64
&lt;span class="nt"&gt;--draft-min&lt;/span&gt; 48
&lt;span class="nt"&gt;--spec-ngram-size-n&lt;/span&gt; 24
&lt;span class="nt"&gt;--spec-ngram-size-m&lt;/span&gt; 48
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How it works: llama.cpp builds an n-gram lookup table from the recent context (both the input prompt and generated output so far). When it spots a pattern it's seen before, it speculatively drafts the next several tokens and verifies them in a single forward pass. If the predictions are right, you get multiple tokens for the cost of one.&lt;/p&gt;

&lt;p&gt;Important: this is specifically n-gram speculative decoding, not draft-model approaches like EAGLE-3 or Medusa. Those use a separate trained model to generate speculations. N-gram lookup is simpler and doesn't require any extra model files.&lt;/p&gt;

&lt;p&gt;With LLMKube, switching between configs is just updating the &lt;code&gt;extraArgs&lt;/code&gt; field in the InferenceService CRD and letting the operator restart the pod:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b-a4b&lt;/span&gt;
  &lt;span class="na"&gt;extraArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--spec-type"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ngram-mod"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--draft-max"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I tested two variants: &lt;code&gt;ngram-simple&lt;/code&gt; (basic lookup) and &lt;code&gt;ngram-mod&lt;/code&gt; (the variant recommended for MoE models in the llama.cpp docs).&lt;/p&gt;

&lt;h2&gt;
  
  
  The result that fooled me
&lt;/h2&gt;

&lt;p&gt;My first test ran the same prompt 10 times in a row. The numbers looked incredible:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;tok/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (cold)&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;105.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;112.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;186.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;336.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;419.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Almost 5x speedup by run 10. I was ready to write a very different article.&lt;/p&gt;

&lt;p&gt;Then I ran 8 different prompts. Code generation, API design, Go functions, bash scripts, technical explanations. Real variety.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;Baseline (tok/s)&lt;/th&gt;
&lt;th&gt;+ ngram-mod (tok/s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BST implementation&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;94.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K8s operator explanation&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU monitoring script&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;87.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;REST API design&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GGUF parser in Go&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallelism explainer&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Benchmark script&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helm chart design&lt;/td&gt;
&lt;td&gt;88.1&lt;/td&gt;
&lt;td&gt;88.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Median&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88.2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Zero improvement. The 419 tok/s "speedup" was the n-gram cache memorizing repeated output patterns. With diverse prompts, there's nothing useful to cache.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same story on the dense model
&lt;/h2&gt;

&lt;p&gt;Qwen3-32B showed the same pattern. 20.4 tok/s baseline, 20.6 tok/s with ngram-simple. Within measurement noise.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;+ ngram-simple&lt;/th&gt;
&lt;th&gt;+ ngram-mod&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B&lt;/td&gt;
&lt;td&gt;MoE&lt;/td&gt;
&lt;td&gt;88.3&lt;/td&gt;
&lt;td&gt;87.2 (-1.2%)&lt;/td&gt;
&lt;td&gt;88.2 (0%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-32B&lt;/td&gt;
&lt;td&gt;Dense&lt;/td&gt;
&lt;td&gt;20.4&lt;/td&gt;
&lt;td&gt;20.6 (+1%)&lt;/td&gt;
&lt;td&gt;not tested&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why it doesn't help on these GPUs
&lt;/h2&gt;

&lt;p&gt;The bottleneck on RTX 5060 Ti is memory bandwidth, not compute. Every token requires reading model weights from VRAM. Speculative decoding tries to batch multiple verification steps together, but when you're already saturating the memory bus during single-token generation, there's not enough idle compute for the speculative verification to pay for itself.&lt;/p&gt;

&lt;p&gt;This is different from high-end datacenter GPUs (A100, H100) where the compute-to-memory bandwidth ratio is much higher. An H100 has roughly 3,350 GB/s memory bandwidth but nearly 2,000 TFLOPS of FP16 compute. That ratio means there's genuine idle compute at small batch sizes that speculative decoding can exploit. Consumer GPUs don't have that same headroom.&lt;/p&gt;

&lt;p&gt;For MoE models specifically, there's an additional wrinkle. Each speculative token in a verification batch may activate different experts, which means more expert weight blocks need to be read. This reduces the batching advantage that speculative decoding relies on in dense models, where weight reads stay roughly constant regardless of batch size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caveat:&lt;/strong&gt; there are scenarios where n-gram spec decoding can help even on consumer hardware. If your model is partially CPU-offloaded (doesn't fit in VRAM), the PCIe bandwidth bottleneck is severe enough that speculative batching can provide real gains. And for highly repetitive or templated outputs (think structured JSON, boilerplate code), the n-gram cache hit rate goes way up. My testing focused on single-user inference with fully VRAM-resident models and diverse prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about EAGLE-3?
&lt;/h2&gt;

&lt;p&gt;I originally wanted to test EAGLE-3, which uses a trained draft head instead of n-gram lookup. Three problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No EAGLE-3 draft model exists for Gemma 4 (no one has trained one)&lt;/li&gt;
&lt;li&gt;The llama.cpp EAGLE-3 PR (#18039) is still open and in draft as of April 5, 2026&lt;/li&gt;
&lt;li&gt;The PR's own benchmarks show MoE models getting roughly 0.89-1.06x on certain prompts, with some actually slower due to the expert activation overhead during batch verification&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even with a trained draft head, the fundamental bandwidth constraint on consumer GPUs would remain.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually helps on consumer GPUs
&lt;/h2&gt;

&lt;p&gt;If you're running local LLMs on consumer hardware, here's what actually moves the needle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flash attention&lt;/strong&gt;: Already standard, significant memory savings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV cache quantization&lt;/strong&gt;: q4_0 or q8_0 reduces cache memory pressure without meaningful quality loss&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoE over dense&lt;/strong&gt;: Gemma 4 activates ~4B parameters per token vs Qwen3-32B's 32B. That's the primary driver of the throughput difference, though MoE routing overhead means the speedup isn't a clean 8x ratio.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-GPU split&lt;/strong&gt;: Doubles your available memory bandwidth, which is the actual bottleneck&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context size tuning&lt;/strong&gt;: Smaller context = less KV cache = more VRAM headroom&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The benchmarking lesson
&lt;/h2&gt;

&lt;p&gt;The biggest takeaway wasn't about speculative decoding. It was about benchmark methodology.&lt;/p&gt;

&lt;p&gt;If I'd only tested with repeated prompts, I would have reported a 4.75x speedup and been completely wrong. The n-gram cache is doing something real, but only in a narrow scenario where outputs are highly repetitive or templated. For interactive chat, coding assistance, or any workload with diverse inputs, it provides no benefit on this hardware.&lt;/p&gt;

&lt;p&gt;Be skeptical of speculative decoding benchmarks that don't disclose their prompt diversity. And if you see someone reporting huge n-gram gains, check if they're running the same prompt over and over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;Everything I tested runs on Kubernetes via LLMKube. The InferenceService CRD's &lt;code&gt;extraArgs&lt;/code&gt; field makes it trivial to swap between configs without touching your deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference.llmkube.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;InferenceService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-spec-bench&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;modelRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b-a4b&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/ggml-org/llama.cpp:server-cuda&lt;/span&gt;
  &lt;span class="na"&gt;contextSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8192&lt;/span&gt;
  &lt;span class="na"&gt;flashAttention&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;extraArgs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--spec-type"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ngram-mod"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--draft-max"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64"&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLMKube is open source, Apache 2.0: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>kubernetes</category>
      <category>gpu</category>
      <category>ai</category>
    </item>
    <item>
      <title>Google Released Gemma 4 Yesterday. I Had It Fixing Real Bugs by Lunch.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Fri, 03 Apr 2026 16:34:48 +0000</pubDate>
      <link>https://dev.to/defilan/google-released-gemma-4-yesterday-i-had-it-fixing-real-bugs-by-lunch-cp0</link>
      <guid>https://dev.to/defilan/google-released-gemma-4-yesterday-i-had-it-fixing-real-bugs-by-lunch-cp0</guid>
      <description>&lt;p&gt;Google released Gemma 4 yesterday. By the time I went to bed, I had it deployed on my home lab, running real coding benchmarks at 96 tokens per second.&lt;/p&gt;

&lt;p&gt;The catch: no official llama.cpp image supported the &lt;code&gt;gemma4&lt;/code&gt; architecture yet. The stock CUDA images crash with &lt;code&gt;unknown model architecture: 'gemma4'&lt;/code&gt;. So I built it from source, on the same Kubernetes cluster that serves inference.&lt;/p&gt;

&lt;p&gt;This post is about what it took to go from "model dropped" to "running in production" in about two hours on consumer hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;My home inference server (I call it ShadowStack):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x NVIDIA RTX 5060 Ti (16GB each, 32GB total VRAM)&lt;/li&gt;
&lt;li&gt;AMD Ryzen 9 7900X, 64GB DDR5&lt;/li&gt;
&lt;li&gt;Ubuntu 24.04, MicroK8s&lt;/li&gt;
&lt;li&gt;NVIDIA driver 590.48.01 (CUDA 13.1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is managed by &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt;, a Kubernetes operator I built for running llama.cpp inference. One CRD to define the model, one CRD to define the service, the operator handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: The Architecture Problem
&lt;/h2&gt;

&lt;p&gt;First attempt, I tried the &lt;code&gt;server-cuda13&lt;/code&gt; image (CUDA 13 build of llama.cpp):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Gemma 4 architecture hadn't shipped in any released llama.cpp build yet. The support was only in HEAD.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Build From HEAD On-Cluster
&lt;/h2&gt;

&lt;p&gt;I have a Kaniko build pipeline on the cluster from a previous project (TurboQuant benchmarking). I wrote a Dockerfile that clones llama.cpp HEAD and builds with CUDA targeting SM 86 (Ampere) and SM 120 (Blackwell):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;nvidia/cuda:12.8.0-devel-ubuntu24.04&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 https://github.com/ggml-org/llama.cpp.git /build/llama.cpp

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /build/llama.cpp&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /usr/local/cuda/lib64/stubs/libcuda.so &lt;span class="se"&gt;\
&lt;/span&gt;          /usr/local/cuda/lib64/stubs/libcuda.so.1
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}&lt;/span&gt;

&lt;span class="k"&gt;RUN &lt;/span&gt;cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"86;120"&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--target&lt;/span&gt; llama-server &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Kaniko Job on the cluster built this in about 15 minutes and pushed it to my local container registry. The same cluster that runs inference also builds its own inference server. No external CI needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Deploy
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llmkube deploy gemma4-26b &lt;span class="nt"&gt;--gpu&lt;/span&gt; &lt;span class="nt"&gt;--accelerator&lt;/span&gt; cuda &lt;span class="nt"&gt;--gpu-count&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; https://huggingface.co/Trilogix1/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; registry.defilan.net/llama-server-latest:gemma4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="nt"&gt;--jinja&lt;/span&gt; &lt;span class="nt"&gt;--context&lt;/span&gt; 32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model is 15.6 GB at Q4_K_M. With both GPUs, that leaves about 16 GB for KV cache. Plenty for 32K context.&lt;/p&gt;

&lt;p&gt;The operator downloaded the model, created the Deployment with the right GPU flags, set up health probes, and exposed an OpenAI-compatible endpoint. From the deploy command to the first inference request was about 3 minutes (mostly model download time).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Single Request
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation&lt;/td&gt;
&lt;td&gt;96 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt processing&lt;/td&gt;
&lt;td&gt;128 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model size (Q4_K_M)&lt;/td&gt;
&lt;td&gt;15.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active parameters per token&lt;/td&gt;
&lt;td&gt;4B (MoE)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Under Load (4 concurrent workers, 2 minutes)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate throughput&lt;/td&gt;
&lt;td&gt;170 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total requests&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rate&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P50 latency&lt;/td&gt;
&lt;td&gt;~2s per request&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For context, the generic benchmarks floating around say Gemma 4 26B-A4B "exceeds 40 tok/s on consumer hardware." We're doing 96 tok/s on a single request and 170 tok/s aggregate under concurrent load. The dual-GPU split and the MoE architecture (only 4B parameters active per token) make this model surprisingly fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Coding Benchmarks
&lt;/h2&gt;

&lt;p&gt;I didn't just run "hello world" tests. I fed it actual bug reports from my own project and asked it to generate fixes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug: GPU Rolling Update Deadlock
&lt;/h3&gt;

&lt;p&gt;The issue: Kubernetes rolling updates deadlock on GPU workloads because the new pod can't schedule (old pod holds GPUs) and the old pod won't terminate (waiting for new pod to be Ready).&lt;/p&gt;

&lt;p&gt;Gemma 4's response: correctly identified that GPU workloads should use &lt;code&gt;Recreate&lt;/code&gt; strategy instead of &lt;code&gt;RollingUpdate&lt;/code&gt;, with a conditional check on GPU count. Showed the chain-of-thought reasoning, considered edge cases, and verified against the pattern before outputting.&lt;/p&gt;

&lt;p&gt;Time: 10.6 seconds for a 1024-token response including the full reasoning chain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug: Stale Endpoints After Deletion
&lt;/h3&gt;

&lt;p&gt;The issue: deleting an InferenceService leaves orphaned Kubernetes Endpoints.&lt;/p&gt;

&lt;p&gt;Gemma 4's response: generated a complete &lt;code&gt;UnregisterEndpoint&lt;/code&gt; method with DNS name sanitization, Service and Endpoints deletion, &lt;code&gt;NotFound&lt;/code&gt; error handling, and logging. Production-quality Go code on the first try.&lt;/p&gt;

&lt;p&gt;Time: 11.1 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Generation: Ginkgo BDD Tests
&lt;/h3&gt;

&lt;p&gt;I asked it to write tests following an existing pattern in the codebase. It generated 4 correct test cases with &lt;code&gt;BeforeEach&lt;/code&gt; setup, proper assertions, and the right Gomega matchers. Used &lt;code&gt;ContainElements&lt;/code&gt; for present checks and &lt;code&gt;NotTo(ContainElement())&lt;/code&gt; for absent checks, matching the exact conventions from the rest of the test suite.&lt;/p&gt;

&lt;p&gt;Time: 12.3 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;I'm not claiming Gemma 4 replaces Claude or GPT-4. It doesn't. The reasoning is shallower on complex multi-step problems, and it occasionally cuts off mid-response at the token limit.&lt;/p&gt;

&lt;p&gt;What I am claiming: the gap between "Google releases a new model" and "it's running on your hardware fixing real bugs" has shrunk to hours, not weeks. The pieces are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GGUF quantization appears on HuggingFace within hours of a model release&lt;/li&gt;
&lt;li&gt;llama.cpp HEAD usually has architecture support on day one (the tokenizer and template fixes were already committed)&lt;/li&gt;
&lt;li&gt;Kaniko or similar tools let you build from source on-cluster without a separate CI pipeline&lt;/li&gt;
&lt;li&gt;A Kubernetes operator (in my case, LLMKube) lets you deploy with one command and get health checks, metrics, and an OpenAI-compatible API&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the same workflow regardless of whether the model is Gemma 4, Qwen3.5, Llama, or whatever ships next week. The infrastructure is model-agnostic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardware Math
&lt;/h2&gt;

&lt;p&gt;This entire setup cost about $2,400:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x RTX 5060 Ti: ~$800&lt;/li&gt;
&lt;li&gt;Ryzen 9 7900X + motherboard + RAM + SSD + case + PSU: ~$1,600&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running 24/7, the system draws about 50-60W idle and 500-600W under full inference load. At $0.12/kWh, that's roughly $30-50/month in electricity for unlimited inference.&lt;/p&gt;

&lt;p&gt;Compare to API costs: at OpenAI's pricing for a comparable model, 110 requests in 2 minutes would cost roughly $5-10. Scale that to continuous use and the hardware pays for itself in a month or two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;LLMKube is open source (Apache 2.0): &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;github.com/defilantech/llmkube&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have a GPU and a Kubernetes cluster (even a single-node K3s or MicroK8s), you can deploy any GGUF model with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;llmkube llmkube/llmkube
llmkube deploy llama-3.1-8b &lt;span class="nt"&gt;--gpu&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Gemma 4 specifically, you'll need a custom llama.cpp image until the official builds ship with &lt;code&gt;gemma4&lt;/code&gt; architecture support. The Dockerfile above works.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Benchmarks run on April 2, 2026 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.1, driver 590.48.01). Gemma 4 26B-A4B-it Q4_K_M via llama.cpp built from HEAD commit f851fa5a.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>llm</category>
      <category>homelab</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Tested TurboQuant KV Cache Compression on Consumer GPUs. Here's What Actually Happened.</title>
      <dc:creator>Christopher Maher</dc:creator>
      <pubDate>Mon, 30 Mar 2026 15:12:24 +0000</pubDate>
      <link>https://dev.to/defilan/i-tested-turboquant-kv-cache-compression-on-consumer-gpus-heres-what-actually-happened-beg</link>
      <guid>https://dev.to/defilan/i-tested-turboquant-kv-cache-compression-on-consumer-gpus-heres-what-actually-happened-beg</guid>
      <description>&lt;p&gt;I spent this weekend testing TurboQuant KV cache compression on my home lab Kubernetes cluster. The paper (ICLR 2026, Google Research) promises up to 4.57x compression of the KV cache with minimal quality loss. That sounded like exactly what I needed. I'm always bumping up against VRAM limits trying to run larger models or longer contexts on consumer hardware.&lt;/p&gt;

&lt;p&gt;Here's what I found: it works, but there are real tradeoffs nobody's talking about yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: KV Cache Eats Your VRAM
&lt;/h2&gt;

&lt;p&gt;If you've run LLMs locally, you know the drill. You load a 32B model that fits in 20GB of VRAM, set the context to 32K, and suddenly you're at 28GB. The model weights didn't change. It's the KV cache growing linearly with context length.&lt;/p&gt;

&lt;p&gt;For every token in the context, the model stores key and value vectors for every attention head at every layer. In FP16, that adds up fast. A 32B model at 32K context can burn through 8+ GB of VRAM just for the KV cache.&lt;/p&gt;

&lt;p&gt;TurboQuant's approach is to apply a Walsh-Hadamard Transform (WHT) rotation to KV cache vectors before quantizing them to 3 bits. The rotation "gaussianizes" the distribution, making scalar quantization much more effective. The result is TQ3_0: roughly 3 bits per element instead of 16, for a theoretical 4.57x compression.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;: ShadowStack, my home inference server&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2x NVIDIA RTX 5060 Ti (16GB GDDR7 each, 32GB total)&lt;/li&gt;
&lt;li&gt;AMD Ryzen 9 7900X, 64GB DDR5&lt;/li&gt;
&lt;li&gt;Ubuntu 24.04, MicroK8s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software&lt;/strong&gt;: &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt;, an open-source Kubernetes operator I built for managing llama.cpp inference workloads. It handles model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics through Kubernetes CRDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TurboQuant build&lt;/strong&gt;: I used the &lt;a href="https://github.com/animehacker/llama-turboquant" rel="noopener noreferrer"&gt;animehacker/llama-turboquant&lt;/a&gt; fork, which has working CUDA kernels for the WHT-based TQ3_0 type. This is a Stage 1 implementation (no QJL residual correction from the full paper). I built it with Kaniko directly on my cluster targeting SM 86 (Ampere) and SM 120 (Blackwell).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Wrapper Entrypoint Pattern
&lt;/h3&gt;

&lt;p&gt;LLMKube's InferenceService CRD doesn't have a &lt;code&gt;--cache-type&lt;/code&gt; flag yet, so I built a custom Docker image with a wrapper entrypoint that injects the TurboQuant flags transparently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# entrypoint.sh - passes through all LLMKube args, appends TQ flags&lt;/span&gt;
&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;tq3_0&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;true&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_ENABLED&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;llama-server &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TQ_CACHE_TYPE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else
    &lt;/span&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;llama-server &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using &lt;code&gt;exec&lt;/code&gt; is important. It makes llama-server PID 1 so Kubernetes health probes and signal handling work correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Methodology
&lt;/h2&gt;

&lt;p&gt;Apples-to-apples. Same model weights, same context size, same concurrency. The only variable was the KV cache type (FP16 vs TQ3_0). Flash attention was enabled for all tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput test&lt;/strong&gt;: 5 minutes of sustained load at 4 concurrent requests, 8K context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context sweep&lt;/strong&gt;: Deploy at each context size (4K through 131K), run a 2-minute stress test, record VRAM via nvidia-smi.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models tested&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 8B (Q5_K_M), small model with lots of headroom&lt;/li&gt;
&lt;li&gt;Qwen 2.5 14B (Q5_K_M), medium model that fills one GPU&lt;/li&gt;
&lt;li&gt;Qwen 2.5 32B (Q4_K_M), large model that requires both GPUs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results: Throughput
&lt;/h2&gt;

&lt;p&gt;This is where TurboQuant hurts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Gen tok/s&lt;/th&gt;
&lt;th&gt;Prompt tok/s&lt;/th&gt;
&lt;th&gt;Requests (5min)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 8B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;565.5&lt;/td&gt;
&lt;td&gt;771&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 8B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;93.4&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 14B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;122.0&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 14B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;63.4&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 32B&lt;/td&gt;
&lt;td&gt;FP16 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;133.3&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 32B&lt;/td&gt;
&lt;td&gt;TQ3_0 cache&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.5&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Generation throughput dropped 5-6x across all models. Prompt processing dropped roughly 2-6x depending on model size. This is consistent with what the PR benchmarks showed on CPU, but I expected Blackwell's tensor cores to help more than they did. The animehacker CUDA kernels were optimized for Ampere (SM 86), not Blackwell (SM 120), so there's likely performance left on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: VRAM Usage
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Llama 3.1 8B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;6.4 GB&lt;/td&gt;
&lt;td&gt;10.1 GB&lt;/td&gt;
&lt;td&gt;-58% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;6.9 GB&lt;/td&gt;
&lt;td&gt;14.3 GB&lt;/td&gt;
&lt;td&gt;-107% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;8.0 GB&lt;/td&gt;
&lt;td&gt;22.8 GB&lt;/td&gt;
&lt;td&gt;-185% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;10.1 GB&lt;/td&gt;
&lt;td&gt;6.9 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;14.3 GB&lt;/td&gt;
&lt;td&gt;8.4 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;41% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;98K&lt;/td&gt;
&lt;td&gt;18.5 GB&lt;/td&gt;
&lt;td&gt;9.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;47% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;131K&lt;/td&gt;
&lt;td&gt;22.7 GB&lt;/td&gt;
&lt;td&gt;11.2 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 14B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;11.1 GB&lt;/td&gt;
&lt;td&gt;16.7 GB&lt;/td&gt;
&lt;td&gt;-50% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;11.9 GB&lt;/td&gt;
&lt;td&gt;23.0 GB&lt;/td&gt;
&lt;td&gt;-93% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;13.4 GB&lt;/td&gt;
&lt;td&gt;11.0 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;16.6 GB&lt;/td&gt;
&lt;td&gt;11.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;29% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;22.8 GB&lt;/td&gt;
&lt;td&gt;13.7 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Qwen 2.5 32B, Context Sweep
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;FP16 VRAM (total)&lt;/th&gt;
&lt;th&gt;TQ3_0 VRAM (total)&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2K&lt;/td&gt;
&lt;td&gt;19.9 GB&lt;/td&gt;
&lt;td&gt;23.7 GB&lt;/td&gt;
&lt;td&gt;-19% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4K&lt;/td&gt;
&lt;td&gt;20.5 GB&lt;/td&gt;
&lt;td&gt;27.9 GB&lt;/td&gt;
&lt;td&gt;-36% (worse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8K&lt;/td&gt;
&lt;td&gt;21.6 GB&lt;/td&gt;
&lt;td&gt;19.8 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16K&lt;/td&gt;
&lt;td&gt;23.7 GB&lt;/td&gt;
&lt;td&gt;20.3 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32K&lt;/td&gt;
&lt;td&gt;28.0 GB&lt;/td&gt;
&lt;td&gt;21.4 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24% better&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Surprise: TQ Uses MORE VRAM at Small Contexts
&lt;/h2&gt;

&lt;p&gt;I wasn't expecting this. At 4K-16K context, TQ3_0 consistently used more VRAM than the FP16 baseline. Sometimes dramatically more. Llama 8B at 16K context used 22.8 GB with TQ vs 8.0 GB with FP16.&lt;/p&gt;

&lt;p&gt;My theory: the WHT rotation machinery has a fixed overhead (lookup tables, rotation matrices, codebooks) that gets allocated regardless of context size. When the KV cache is small, this overhead dwarfs the compression savings. The crossover point where TQ starts winning varies by model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama 8B: around 32K context&lt;/li&gt;
&lt;li&gt;Qwen 14B: around 16K context&lt;/li&gt;
&lt;li&gt;Qwen 32B: around 8K context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Larger models cross over earlier because their per-token KV cache is larger (more layers, more attention heads), so the compression pays off sooner.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Is TurboQuant Worth It?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use TQ3_0 when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need 32K+ context on consumer GPUs&lt;/li&gt;
&lt;li&gt;You're hitting VRAM limits and can't afford more hardware&lt;/li&gt;
&lt;li&gt;Throughput isn't critical (batch processing, RAG with long documents, analysis tasks)&lt;/li&gt;
&lt;li&gt;You're running a large model (32B+) where the crossover point is lower&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Don't use TQ3_0 when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context is under 16K (you'll actually use more VRAM)&lt;/li&gt;
&lt;li&gt;You need interactive throughput (the 5x penalty makes chat unusable)&lt;/li&gt;
&lt;li&gt;You're on Blackwell and want optimal performance (wait for SM 120-optimized kernels)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sweet spot in my testing was Qwen 32B at 32K context. Baseline uses 28 GB, which is dangerously close to my 32 GB ceiling. One concurrent request could OOM. TQ drops it to 21.4 GB, leaving over 10 GB of headroom for parallel slots or longer contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The throughput penalty is the main blocker. The animehacker CUDA kernels use a fused MMVQ approach that avoids dequantization during attention, but the WHT butterfly transform still runs 160 integer ops per element in registers. On Blackwell with its new SM architecture, these kernels likely aren't hitting optimal occupancy.&lt;/p&gt;

&lt;p&gt;Things I'm watching:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/ggml-org/llama.cpp/pull/21089" rel="noopener noreferrer"&gt;PR #21089&lt;/a&gt; on ggml-org/llama.cpp, the only open upstream PR for TurboQuant (CPU-only for now)&lt;/li&gt;
&lt;li&gt;Whether &lt;code&gt;ggerganov&lt;/code&gt; engages with it. If he requests changes rather than closing, it'll eventually land.&lt;/li&gt;
&lt;li&gt;SM 120-optimized CUDA kernels. Blackwell has new instructions that could close the throughput gap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For LLMKube, I'm planning to add &lt;code&gt;cacheTypeK&lt;/code&gt; and &lt;code&gt;cacheTypeV&lt;/code&gt; fields to the InferenceService CRD so users can configure this without the wrapper entrypoint hack. Also an &lt;code&gt;extraArgs&lt;/code&gt; escape hatch for any llama.cpp flag we don't have a typed field for yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;All the benchmarking infrastructure is in the &lt;a href="https://github.com/defilantech/llmkube" rel="noopener noreferrer"&gt;LLMKube&lt;/a&gt; repo. The operator is open source (Apache 2.0) and handles the full lifecycle: model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics. If you have a GPU cluster and want to test TurboQuant:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build the custom image from &lt;code&gt;animehacker/llama-turboquant&lt;/code&gt; with &lt;code&gt;-DGGML_CUDA=ON&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;spec.image&lt;/code&gt; on your InferenceService to point at it&lt;/li&gt;
&lt;li&gt;The wrapper entrypoint handles the rest&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you run these benchmarks on different hardware (A100, RTX 3090, etc.), I'd love to see the numbers. Drop a comment or find me on the &lt;a href="https://discord.gg/5GavYFPBBr" rel="noopener noreferrer"&gt;LLMKube Discord&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Benchmarks run on 2026-03-30 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.0).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>kubernetes</category>
      <category>gpu</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
