DEV Community: Christopher Maher

TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

Christopher Maher — Wed, 29 Apr 2026 19:52:16 +0000

Originally published at llmkube.com/blog/turboquant-m5-max-quality-and-asymmetric. Cross-posted here for the dev.to audience.

Yesterday's M5 Max KV cache post drew a clean set of asks in the comments: where are the perplexity numbers, what about KL divergence, did you try asymmetric K/V combos, can you fill the 32K to 128K gap with a 64K row. I ran them overnight on the same hardware. Numbers below.

TL;DR

q8_0 KV cache is essentially free at 4k context. PPL delta vs f16 is −0.0005 (well inside the ±0.036 stderr). KL is 0.0016. Top-1 token agreement is 98.64%.
turbo3 and turbo4 cost real but small quality. turbo3: ~1% PPL increase, 5pp top-token disagreement, KL roughly 12× q8_0. turbo4 sits between, in line with its lower compression ratio.
-ctk q8_0 -ctv turbo4 is the new winner for long-context. Matches symmetric q8_0 throughput at every depth tested and fits 512K, where symmetric q8_0 OOM'd. q8_0-grade prefill, turbo4-grade memory ceiling.
-ctk f16 -ctv turbo4 is broken on this fork on Metal. The Metal FlashAttention kernel doesn't fast-path that K/V combination, so it falls back to a generic dequant-then-attention path. 34× slower at 8K, 78× slower at 128K. Don't use it.
The 64K row shows the prefill curves nearly converged. turbo3 at 470 tok/s sits within 2% of q8_0 at 479 tok/s. The bandwidth-bound regime kicks in somewhere between 64K and 128K on this hardware, earlier than the 128K crossover from yesterday's post had me estimating.

Quality eval: perplexity and KL divergence

The original post had no quality numbers. The first comment under it flagged that gap and asked for perplexity, and a follow-on comment added KL divergence to the list. Both are now in the bag.

Setup: llama-perplexity from TheTom's TurboQuant fork build, wikitext-2-raw test set, context size 4096. The canonical 512 doesn't fill enough KV cache to surface cache-quantization effects, so I bumped it to 4096 to let the cache actually fill. The f16 run saves a baseline logits file via --kl-divergence-base. Each subsequent run computes KL against that baseline, which means the comparisons are pinned to the exact same model weights and tokenization.

Cache type	PPL	KL vs f16	Top-1 token agreement
`f16`	5.7438 ± 0.0355	baseline	n/a
`q8_0`	5.7433 ± 0.0355	0.0016 ± 0.0001	98.64% ± 0.03
`turbo3` (~4.9×)	5.8092 ± 0.0360	0.0199 ± 0.0002	93.93% ± 0.06
`turbo4` (~3.8×)	5.7810 ± 0.0359	0.0131 ± 0.0003	95.28% ± 0.06

q8_0 KV is essentially free at this depth. PPL delta is −0.0005, which is noise inside the ±0.0355 stderr. KL is 0.0016, three orders of magnitude smaller than the turbo3 number. The quantized cache picks the same top-1 token as f16 98.64% of the time. The community worry about q8_0 corroding output quality doesn't bear out here.

turbo3 costs measurable but small quality. ~1% perplexity increase, 5 percentage points of top-token disagreement, KL roughly 12× q8_0's. turbo4 sits between turbo3 and q8_0 on every metric, matching its lower compression ratio. Quality cost scales monotonically with compression, no surprises in the ranking.

One caveat I'd want to underline: PPL was at 4096 context. Quality at deeper contexts, where the cache is more saturated and dequant errors compound across more attention steps, might tell a different story. That's a bench for a future weekend.

Asymmetric K/V: which combos work, which don't

One commenter on the original post pointed out that the big issue asymmetric KV tackles is exactly the K-precision problem: compressing the keys hurts quality a great deal more than compressing the values. The original post called this out in its caveats too but didn't bench it. Now we have data.

Three combinations, same llama-bench flags as yesterday's symmetric sweep. Decode tok/s (token generation):

Depth	q8_0 K / turbo4 V	q8_0 K / turbo3 V	f16 K / turbo4 V
0	82.9	81.8	72.8
8K	75.4	75.6	16.9
32K	66.0	63.2	8.6
128K	41.0	38.2	2.8
256K	27.1	25.0	skipped
512K	16.5	14.8	skipped

Prompt processing tells a similar story (skipping the full table for length, the relative ordering matches): q8_0/turbo4 lands within 1-2% of symmetric q8_0 prefill at every shared depth, and q8_0/turbo3 is similarly close.

`-ctk q8_0 -ctv turbo4`: the new long-context winner

This is the standout combination. At 256K context it puts up 27.1 tok/s decode against yesterday's symmetric q8_0 baseline of 26.6 tok/s. Prefill at 256K hits 128 tok/s versus symmetric q8_0's 124. The throughput is statistically indistinguishable from symmetric q8_0 at every depth they share.

And it fits 512K, where symmetric q8_0 OOM'd in yesterday's post. Decode at 512K is 16.5 tok/s, almost identical to symmetric turbo4 at 16.0. So the asymmetric configuration gets you q8_0-level prefill behavior with turbo4-level context ceiling, on a single MacBook Pro.

The hypothesis that V compresses cheap and K compresses expensive looks right on the throughput side. Quality side I'd want a PPL run on the asymmetric combos to fully close the loop, since I haven't measured KL or PPL with mixed K/V types yet.

`-ctk q8_0 -ctv turbo3`: similar trick, worse decode

Same prefill behavior as the q8_0/turbo4 combo (within 1-2% at every depth) but decode is consistently lower. Tighter V quantization taxes the per-token attention pass more, since decode is bottlenecked by dequantization work rather than total bytes read. If you have memory headroom, q8_0/turbo4 dominates.

`-ctk f16 -ctv turbo4`: kernel fallback, do not use

Putting f16 on K and turbo4 on V breaks the Metal FlashAttention kernel's fast path. The fork falls back to a generic dequant-then-attention implementation that's catastrophically slow:

Depth	Symmetric f16 pp512	f16 K / turbo4 V pp512	Slowdown
8K	2098	61.0	34×
32K	1063	16.4	65×
128K	321	4.1	78×

I cut the run before 256K once the trajectory was clear. The slowdown widens with depth, which is consistent with the non-fast-path attention being O(n) more expensive in dequant work per cache access. Don't use this combination on this fork on Metal until kernel coverage lands. If you're on a different backend (CUDA), verify the same combo before assuming it works.

The 64K row: filling the gap

One commenter asked for a 64K data point sitting between 32K and 128K, particularly on the prefill side. Reasonable ask: yesterday's prefill curves dropped 3–4× between those two depths, so 64K is exactly the depth where the bandwidth-bound regime is supposed to kick in.

All seven configurations at depth 65536:

Cache	pp512 (tok/s)	tg128 (tok/s)
`f16` (symmetric)	602.0	59.8
`q8_0` (symmetric)	479.2	57.9
`turbo3` (symmetric)	469.8	49.9
`turbo4` (symmetric)	418.0	55.2
q8_0 K / turbo4 V	468.2	55.9
q8_0 K / turbo3 V	465.6	52.6
f16 K / turbo4 V	8.3	4.9

Two things stood out. First, the prefill curves are nearly converged at 64K already. turbo3 at 470 tok/s is within 2% of q8_0 at 479 tok/s. Yesterday's data showed turbo3 actually pulling ahead of q8_0 by 128K (253 vs 245), so the bandwidth-bound regime kicks in somewhere in the 64K to 128K range on this hardware. Earlier than I'd estimated when I wrote the original post.

Second, the asymmetric q8_0/turbo* rows track symmetric q8_0 prefill closely at this depth, same as they do at the deeper depths. Same story all the way down the curve: as long as K stays at q8_0, V-side compression is essentially free on prefill.

Updated cache-type recommendations

Same shape as yesterday's recommendations, with the asymmetric data folded in:

Workload	Cache type	Why
Coding agents (deep context, lots of generated tokens)	`-ctk q8_0 -ctv turbo4`	q8_0-grade quality on K, turbo4 memory savings on V, fits 512K, decode 27 tok/s at 256K
RAG or batch QA (heavy prefill, short answers)	`-ctk q8_0 -ctv turbo4` or symmetric `turbo3` at the deepest depths	Prefill is bandwidth-bound past ~64K, both options work
Pure 1M context maxing	Symmetric `turbo3`	Only thing that fits 1M on a 128 GB Mac
Short interactive (under 32K)	`f16` if memory allows, else `q8_0`	Quality cost is genuinely zero, throughput is best

The asymmetric combos are expressible directly in LLMKube's InferenceService spec via the cacheTypeCustomK and cacheTypeCustomV fields that landed in 0.7.3. So if you're running this through the operator, the spec for the new long-context winner is:

spec:
  runtime: llamacpp
  cacheTypeCustomK: q8_0
  cacheTypeCustomV: turbo4
  contextSize: 524288

Caveats

Perplexity was measured at 4096 context. Quality at deeper contexts might tell a different story, since the cache fills more and dequant errors have more attention steps to compound through.
Asymmetric quality numbers (PPL or KL on the q8_0/turbo* combos) are not yet measured. The throughput data argues V-side compression is cheap, but I haven't verified the quality side end-to-end.
-ctk f16 -ctv turbo* is a kernel fallback on this fork on Metal. Verify before assuming the same combination works on other backends. CUDA may have different kernel coverage.
Single hardware data point (M5 Max, 128 GB). The crossover depths and the prefill/decode split likely shift with memory bandwidth and GPU core count.

What's still in flight

Three asks from the original thread that this followup didn't fully address. Running them next:

Aider Polyglot pass for f16, turbo3, turbo4. A commenter asked whether the fast cache types still produce useful code, not just fast tokens. q8_0 scored 62.2% on Polyglot earlier this week (n=225). Each Polyglot run is roughly 6 to 12 hours wall on M5 Max, so this is a few overnight runs serial. Running later this week.
Wider quant types: q4_0, q4_1, iq4_nl, q5_0, q5_1. Another commenter asked for these to extend the depth sweep with more cache options. After the Aider runs.
Same sweep on a non-MoE non-DeltaNet model. A third commenter asked whether these results transfer to other architectures. Qwen 3.6 uses DeltaNet hybrid attention, which already shrinks the per-token KV footprint. On a dense GQA model where cache is the dominant bottleneck the splits should be larger, not smaller. After the wider quant types.

Methodology

Hardware: MacBook Pro M5 Max, 128 GB unified memory. Build: TheTom's llama-cpp-turboquant fork, branch feature/turboquant-kv-cache, built with cmake -B build -DGGML_METAL=ON. Model: Qwen3.6-35B-A3B Q8_0 GGUF.

Quality bench: llama-perplexity on wikitext-2-raw test set, -c 4096, full corpus (~60 chunks). f16 baseline saved via --kl-divergence-base; each quant run loaded the same baseline file via --kl-divergence for KL computation against pinned logits. Same model, same tokenization, only the KV cache type varies.

Throughput bench: llama-bench, -p 512 -n 128 -ngl 99 -fa 1 --threads 6 --batch-size 2048 -r 3, depth sweep via -d. Same flags as yesterday's symmetric sweep so rows are directly comparable. Metal-agent stopped during the run for clean memory budget. Total wall-clock for the asymmetric sweep was about 8.5 hours; the 64K supplement added another 80 minutes.

If you have non-M5-Max Apple Silicon and want to run a slice of this matrix on your hardware, let me know — second data point would help characterize how the crossover shifts with memory bandwidth.

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

Christopher Maher — Tue, 28 Apr 2026 16:38:41 +0000

Originally published at llmkube.com/blog/turboquant-m5-max-long-context. Cross-posted here for the dev.to audience.

A 7-hour overnight bench on an M5 Max, two findings I haven't seen in the upstream community thread, and two PRs back to the LLMKube operator to make TurboQuant a first-class citizen of the InferenceService CRD.

TL;DR

A TurboQuant-enabled llama-server on Apple Silicon runs Qwen3.6-35B-A3B Q8 at up to 1M-token context on a 128 GB MacBook Pro M5 Max. Standard f16 KV cache OOMs at 256K. Two findings worth quoting:

At 128K+ context, the 3-bit KV cache (turbo3) matches or beats the 8-bit cache (q8_0) on prompt processing. Smaller cache means less memory bandwidth pressure during attention, and the throughput gap that exists at short context flips by ~128K depth.
turbo3 and turbo4 split by workload phase. Long-context prefill favors turbo3 (~27% faster than turbo4 at 256K). Long-context decode favors turbo4 (~11% faster than turbo3 at 256K). They are not interchangeable — different attention bottlenecks dominate during prefill and decode.

We built TheTom's feature/turboquant-kv-cache fork of llama.cpp for Metal, validated on M5 Max, and took two PRs back to LLMKube to make TurboQuant first-class on the InferenceService CRD.

Why KV cache, why now

If you're running coding agents locally — single-model or architect+editor combos — the binding constraint isn't model weights. It's KV cache.

Weights you can quantize once, store on disk, and forget. KV cache is generated per token of context at inference time, sized by the model's depth and head dimensions, and held in working memory the entire session. A 35B-class model with flash-attn on uses roughly 256 KB of fp16 KV per token. That sounds small until you do the multiplication:

Context	fp16 KV
32K	~8 GB
64K	~16 GB
128K	~32 GB
256K	~64 GB
512K	~128 GB
1M	~256 GB

A 128 GB MacBook with flash-attn and mlock on can fit one 35B model at 128K with f16 KV, just barely. 256K doesn't fit. Co-resident two-model setups (architect + editor) don't fit at all past 64K.

Standard q8_0 quantization halves the KV footprint with sub-1% perplexity penalty. That gets you to 256K with a single model on the Mac.

TurboQuant (Google Research, ICLR 2026, arxiv:2504.19874) compresses further. Randomized Walsh-Hadamard transforms decorrelate KV blocks before scalar quantization, hitting ~3.25 bits per value (turbo3) or ~4.25 bits per value (turbo4) with attention-fidelity loss inside the noise floor of normal sampling variance.

Cache type	bits/value	Compression vs fp16	KV at 256K
f16	16.0	1.0×	~64 GB
q8_0	8.0	2.0×	~32 GB
turbo4	4.25	3.8×	~17 GB
turbo3	3.25	4.9×	~13 GB

Upstream discussion at ggml-org/llama.cpp#20969. Not yet in main, landing in forks per backend. TheTom's fork is the Metal-supporting variant.

The bench

llama-bench from TheTom's fork build, single Qwen3.6-35B-A3B Q8 model, sweep across cache types and KV-depths.

./build/bin/llama-bench \
  -m Qwen3.6-35B-A3B-Q8_0.gguf \
  -ctk turbo3 -ctv turbo3 \
  -d 0 -d 8192 -d 32768 -d 131072 -d 262144 -d 524288 -d 1048576 \
  -p 512 -n 128 -ngl 99 -fa 1 \
  --threads 6 --batch-size 2048 \
  -r 3 -o md

-d N pre-allocates N tokens of KV cache before measuring throughput. Mean of 3 reps. Metal-agent stopped during the run for clean memory budget. The 1M cell on turbo3 alone took several hours wall-clock; full sweep ran ~7 hours overnight.

The numbers

Generation throughput (tok/s)

Depth	f16	q8_0	turbo3	turbo4
0	89.4	87.4	79.5	79.7
8K	84.2	79.2	72.2	71.2
32K	72.6	67.8	61.5	61.8
64K	60.7	—	—	—
128K	44.4	40.7	36.0	37.7
256K	OOM	26.6	22.9	25.5
512K	OOM	OOM	13.3	16.0
1M	OOM	OOM	6.51	OOM

Prompt processing throughput (tok/s)

Depth	f16	q8_0	turbo3	turbo4
0	2962	2948	2904	2854
8K	2098	1623	1653	1439
32K	1063	802	784	678
128K	321	245	253 ← turbo3 ≥ q8_0	206
256K	OOM	124	128 ← turbo3 > q8_0	101
512K	OOM	OOM	66	56
1M	OOM	OOM	30.1	OOM

Full grid is final. Bench ran 8h 20m wall-clock.

Finding 1: turbo3 beats q8_0 at long context

The framing in the upstream discussion is approximately "turbo3 trades a small (~10%) generation throughput hit for ~2.5× more KV memory headroom." That's true at short context. At long context, the trade flips.

At 128K depth, f16 wins prefill at 321 tok/s, but turbo3 at 253 tok/s edges out q8_0 at 245 tok/s. At 256K (where f16 OOMs), turbo3 at 128 tok/s beats q8_0 at 124 tok/s.

What's happening: at 35B-class model size with deep contexts, the GPU spends most of its time during attention reading KV cache from memory rather than computing on it. Smaller cache → less bandwidth pressure → throughput recovers, even though there's more dequantization work per access. The break-even is somewhere between 32K and 128K on M5 Max.

For coding-agent workloads where context grows monotonically across a session, this is the regime that matters. You're spending most of your tokens at 32K+ depth, not at depth 0.

Finding 2: turbo3 and turbo4 split by workload phase

The 25% extra bits per value in turbo4 (4.25 vs 3.25 bits) buys you something specific, and what it buys depends on the phase.

Prefill (prompt processing) at long context:

Depth	turbo3 pp	turbo4 pp	turbo3 advantage
8K	1653	1439	+15%
32K	784	678	+16%
128K	253	206	+23%
256K	128	101	+27%
512K	66	56	+18%

Smaller cache means less data to read per attention step; during prefill the GPU pulls huge contiguous batches through attention, and the bandwidth-bound regime favors turbo3 cleanly.

Decode (generation) at long context:

Depth	turbo3 tg	turbo4 tg	turbo4 advantage
128K	36.0	37.7	+5%
256K	22.9	25.5	+11%
512K	13.3	16.0	+20%

During decode the dequantization overhead per access matters more than total bytes read. turbo4's simpler representation (4.25 bits has less complex quantization geometry than 3.25 bits) wins at the per-token attention pass — and the gap widens with depth.

Practical implications by workload:

Workload shape	Cache type	Why
Aider/OpenCode coding agents (deep context, lots of generated tokens)	`turbo4`	Wins decode at depth
RAG-heavy / batch question answering (heavy prefill, short answers)	`turbo3`	Wins prefill at depth
Pure context-window maximization (1M context)	`turbo3`	Only it fits at 1M
Short-context interactive (≤32K)	`f16` if it fits, else `q8_0`	Both turbos are ~10% slower

This isn't a framing the upstream community discussion has surfaced clearly. Different bottleneck regimes for different phases, and the right cache type depends on which phase dominates your workload.

What this enables on a MacBook

Three concrete capabilities:

256K context for two co-resident coding models. turbo3 KV at 256K (~13 GB) plus 37 GB Qwen3.6 weights, alongside Devstral-Small-2-24B at the same context with comparable footprint, totals ~88 GB. Under the 100 GB practical budget.
1M context for batch / agentic workloads. turbo3 KV at 1M is ~52 GB. We measured 30 tok/s prefill, 6.5 tok/s decode at 1M on Qwen3.6-35B-A3B Q8. Slow — a 4K-token agent response at 1M context is ~10 minutes wall-clock — but it works. Overnight agentic batches that need the full context window are feasible. As far as we can tell, nobody else has demonstrated this on Apple Silicon yet.
More headroom for non-attention buffers. Cutting KV by 5× makes batch buffers, prefix cache, and draft models for speculative decoding actually composable.

Caveats

TheTom's fork is research-grade. Pinned to commit 11a241d0d; rebases needed as upstream moves.
LLMKube's metal-runtime can't drive turbo3/turbo4 yet because of #349 and #350. PR #353 closes #350; #349 is next.
No perplexity numbers in this run. Throughput and memory ceilings only. The +1% perplexity penalty for turbo3 in the upstream discussion is on Qwen 3.5 — we'll re-run on Qwen 3.6 in a follow-up.
Single hardware sample. M5 Max only. Crossover point and prefill/decode split likely shift with memory bandwidth (614 GB/s on M5 Max) and GPU core count.

What we contributed back

LLMKube PR #351 (merged): cacheTypeCustomK/cacheTypeCustomV on InferenceServiceSpec. Closes #282.
LLMKube PR #353 (open): metal-agent respawns on ISVC spec drift; honors replicas: 0. Closes #350.
Issues filed: #349, #350.
Comment going to llama.cpp discussion #20969 with the M5 Max numbers and the prefill/decode split.

How to try it yourself

# 1. Build TheTom's fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# 2. Run the bench (turbo3 and turbo4 separately to see the split)
./build/bin/llama-bench \
  -m /path/to/your/model.gguf \
  -ctk turbo3 -ctv turbo3 \
  -d 0 -d 32768 -d 131072 -d 262144 \
  -p 512 -n 128 -ngl 99 -fa 1 -r 3 -o md

Memory ceiling depends on your unified-memory budget; sub-64 GB Macs probably can't reach 256K with a 35B-class model at any cache type. M3 Pro/Max territory is more realistic for 13B models at 128K with turbo3.

For NVIDIA: @spiritbuun's CUDA fork is the equivalent path.

Open invitation

If you have non-M5-Max Apple Silicon (M2 Pro/Max, M3 Ultra, M4 Max) and want to run the same bench, we want your numbers. The crossover point and the prefill/decode split likely shift with memory bandwidth.

Drop results in llama.cpp discussion #20969 or open an issue on defilantech/llmkube.

62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.

Christopher Maher — Mon, 27 Apr 2026 06:24:59 +0000

Originally published at llmkube.com/blog/m5-max-aider-polyglot-and-finops. Cross-posted here for the dev.to audience.

A 24-hour Aider Polyglot run, a follow-up bench that blew up in interesting ways, and a working $/MTok number from a Kubernetes operator that scrapes Apple Silicon power live. Two open-source PRs landed today to make all of this reproducible on any M-series Mac.

This is a coding-model benchmark on locally-served weights, plus a FinOps story. Every benchmark number traces to results files we can show you. Every cost number traces to a CSV captured by InferCost during the run. The point is the methodology and the tooling; the model rankings are along for the ride.

TL;DR

Qwen3.6-35B-A3B Q8 (Tongyi Lab, Apache 2.0) hit 62.2% on Aider Polyglot (pass_rate_2, n=225/225) running locally on a MacBook Pro M5 Max via LLMKube's Metal Agent. That places it above Claude Sonnet 4 with 32k thinking budget (61.3%), o1-high (61.7%), DeepSeek R1 original (56.9%), and Claude 3.5 Sonnet (51.6%) on the official Aider leaderboard. It also beats every published Qwen-family entry on the Polyglot board.
Devstral-Small-2-2512 Q8 (Mistral, Apache 2.0) hit 4% on Aider Polyglot diff format, 8% on Aider Polyglot whole format, and 81.7% on HumanEval+ (164 problems, all passed standard). Same model. 20× swing. Benchmark numbers don't transfer across harnesses, and you should never quote one without naming the other.
InferCost ran the whole time. The new Apple Silicon collector (shipped in InferCost v0.3.0) reconciled $0.18/hr against the apple-m5-max CostProfile, with InferCost's reading agreeing with the LLMKube agent's direct gauge within 1.6 W mean delta over the Qwen window. First widely-published $/MTok number for an Apple Silicon LLM workload that traces to a real Prometheus scrape.
Two releases shipped alongside this post make all of it reproducible on your own Mac: LLMKube v0.7.2 (Apple power gauges via powermetrics, security-hardened sudoers, and a one-command make install-powermetrics-sudo) and InferCost v0.3.0 (Metal collector, condition reporting, sample CostProfile).

1. The hardware and what's special about it

The bench machine is a MacBook Pro M5 Max, 2026 model:


GPU	40-core integrated, Metal 4
CPU	18-core (6 P-cores, 12 E-cores)
Unified memory	128 GB
Memory bandwidth	614 GB/s
OS	macOS 25.4 (Darwin)
Price	About $4,500 fully configured

Source: Apple newsroom.

The 614 GB/s bandwidth is the constraint that decides everything that follows. For a dense 24B model at Q8, you need to read about 25 GB per generated token, so the upper bound is 614 / 25 = 24.56 t/s and we measured 24 t/s, within 2.3% of the wall. For a MoE like Qwen3.6-35B-A3B, only the active 3B parameters read per token, so the wall is ~200 t/s and you actually get to choose how to spend the bandwidth. That's the whole story behind why MoE feels fast on a Mac.

Stack: LLMKube v0.7.x with the Metal Agent feature branch from PR #334 cherry-picked in (now main), llama-server from llama.cpp Metal, and a kind cluster on the same host for the K8s control plane. InferCost was running locally via go run ./cmd/main.go, pointed at the LLMKube agent's /metrics endpoint via a new --metal-endpoint flag.

2. Qwen3.6-35B-A3B Q8 on Aider Polyglot

The Qwen3.6 family includes a dense 27B and an MoE variant at 35B total / 3B active per token. We ran the MoE quantized to Q8_0 (~36 GB on disk, fits comfortably in 128 GB unified memory with room for KV cache and the rest of macOS).

Aider Polyglot is a 225-problem benchmark across C++, Go, Java, JavaScript, Python, and Rust, designed to keep top frontier coding LLMs in the 5-50% range. Each model gets two attempts per problem: a single-shot solve, and a second attempt after seeing the failed test output. The headline metric is pass_rate_2, the percentage of problems that passed all tests within those two attempts.

Aider was driven from inside a Docker container (aider-benchmark image) talking to llama-server via OPENAI_API_BASE=http://host.docker.internal:<port>/v1. Edit format was diff (Aider's standard for capable models). Threads = 4. The model id we passed to LiteLLM was openai/Qwen3.6-35B-A3B-Q8_0.gguf, the basename llama-server reports.

The full run took 49.9 hours of inference wall-clock time stretched across about 24 hours of real time, plus a follow-up resume cycle to handle a runaway-reasoning failure mode. More on that in §4.

The headline result

pass_rate_2 = 62.2% (140 of 225), pass_rate_1 = 34.7% (78 of 225).

Verified against the official Aider Polyglot leaderboard yaml pulled today, here's where that lands among the published baselines:

pass_rate_2	format	model
88.0%	diff	gpt-5 (high)
84.9%	diff	o3-pro (high)
81.3%	diff	o3 (high)
79.6%	diff	grok-4 (high)
72.0%	diff	Claude Opus 4 (32k thinking)
71.4%	diff	DeepSeek R1 (0528)
64.9%	diff	Claude 3.7 Sonnet (32k thinking)
64.0%	architect	DeepSeek R1 + Claude 3.5 Sonnet
62.2%	diff	Qwen3.6-35B-A3B Q8 (this run, M5 Max, Apache 2.0, ours)
61.7%	diff	o1-2024-12-17 (high)
61.3%	diff	Claude Sonnet 4 (32k thinking)
60.4%	diff	Claude 3.7 Sonnet (no thinking)
59.6%	diff	Qwen3 235B A22B (no think, Alibaba API)
56.9%	diff	DeepSeek R1 (original)
56.4%	diff	Claude Sonnet 4 (no thinking)
51.6%	diff	Claude 3.5 Sonnet
40.0%	diff	Qwen3 32B
8.0%	diff	Qwen2.5-Coder-32B

The defensible reads:

Beats Claude Sonnet 4 with 32k thinking budget by 0.9 percentage points.
Beats o1-high by 0.5 percentage points.
Beats DeepSeek R1 original by 5.3 percentage points.
Beats Claude 3.5 Sonnet by 10.6 percentage points.
Within 2.7 points of Claude 3.7 Sonnet (32k thinking).
Strongest open-weights Qwen-family number on the Polyglot leaderboard. Qwen3 32B sat at 40.0%, Qwen3 235B A22B at 59.6%. The 35B-A3B MoE quantization is doing real work for its size.

What we are not claiming: that this beats Opus 4, GPT-5, o3, or DeepSeek V3.2-Exp Reasoner. Those all sit above us on the leaderboard. Qwen3.6 is in the same band as Sonnet 4 thinking, not in the band with o3-high or GPT-5.

Per-language

Language	n	pass_1	pass_2	p2 %	avg min/exercise
python	34	14	25	73.5%	4.8
javascript	49	20	35	71.4%	5.3
go	39	16	24	61.5%	6.2
rust	30	14	17	56.7%	9.2
cpp	26	4	14	53.8%	21.7
java	47	10	25	53.2%	31.9

Two things worth noting. First, Python and JavaScript at ~73% looks like clean Sonnet-3.5-thinking territory on the languages most developers actually use Aider for. Second, Java at 31.9 minutes per exercise on average is inflated by the runaway-reasoning case described next. Strip the outlier and Java's average is in line with C++.

3. The runaway-reasoning failure mode (and the resume that closed it out)

About 21 hours into the run, the container settled into a Java exercise that consumed 80 minutes of wall time without writing a new result file or producing meaningful output. The log mtime stayed frozen, the container stayed "Up," and the model was clearly deep in a reasoning loop with no exit strategy. We stopped the container manually at n=223/225 and recorded the runaway-reasoning failure mode as a real characteristic of hybrid-thinking MoE models on agentic harnesses.

The next night, we resumed via Aider's official --cont flag against the same run directory. Two missing exercises (rust/forth and javascript/go-counting) ran in parallel under --threads 4 and completed in about 6 minutes each. Both failed both attempts. Final result: n=225/225, pass_rate_2 = 62.2%.

The headline ticked down by 0.6 percentage points compared to the n=223 partial (62.8% → 62.2%) because the two missing exercises both failed. That's the most honest defense against any "stopped early to lock in a favorable number" critique: completing the run actually hurt us.

If you reproduce this and see a similar hang, kill the container, run with --cont later to fill in the gaps. The full data is healthier than a partial.

4. The other thing we wanted to test

With Qwen3.6 in hand, the natural next move was a comparison candidate. The ideal contrast: a dense model purpose-built for agentic coding, not a general-purpose coder.

Devstral-Small-2-24B-Instruct-2512 was the obvious pick. Mistral and All Hands AI co-trained it specifically for software-engineering agents, it's Apache 2.0 dense 24B, has a 256K context window, and Mistral published 68.0% SWE-Bench Verified for it (a real number on a real benchmark). Released November 2025, so 5 months old at time of writing. Architecture is the new "Ministral 3 with rope-scaling and Scalable-Softmax" stack from Mistral, structurally different from Devstral 1.x.

We deployed it via the same LLMKube + Metal Agent path, kicked off Aider Polyglot with --num-tests 25 (random subset, fits a 4-hour window at Devstral's slower decode speed of ~24 t/s), edit format diff.

Result: pass_rate_2 = 4.0% (1 of 25).

Almost wrote it off as broken. Then read the Aider results files more carefully:

92% of responses were syntactically well-formed diffs.
Zero exhausted context windows.
Average 4.4 minutes per exercise (fast, not stuck).
The model was producing valid-looking edit blocks, they were just semantically wrong.

The model wasn't broken. It was doing what it had been trained to do, which apparently wasn't this.

5. Investigation

Three hypotheses, ordered by what we tried:

Hypothesis 1: The diff format is the problem. Aider supports --edit-format whole (output complete files instead of diffs). Re-ran with whole format on the same 25-exercise subset.

Result: pass_rate_2 = 8.0% (2 of 25). Better, but not by much. Hypothesis weakly supported.

Hypothesis 2: llama.cpp isn't handling Devstral 2's new architecture correctly. Worth checking before declaring the model bad. We ran HumanEval+ via evalplus, pointed at the same llama-server endpoint, with a function-level Python coding harness that doesn't require any agentic tool-call discipline. If llama.cpp's tokenizer or attention implementation was off, we'd see it here.

Result: HumanEval pass@1 = 85.4%, HumanEval+ pass@1 = 81.7% (164 problems, scored in ganler/evalplus Linux container because macOS's setrlimit(RLIMIT_AS) doesn't behave the way evalplus's sandbox expects).

That landed Devstral 2 in the same band as the top open-source 24B coders for function-level Python. Architecture is fine. llama.cpp is fine. The model is genuinely capable.

Hypothesis 3: The harness is the variable. We re-read Mistral's README:

Devstral 2 can also be used with the following scaffoldings:

Mistral Vibe (recommended)

Cline

Kilo Code

Claude Code

OpenHands

SWE Agent

Aider is not on this list. Devstral 2 was trained on tool-call traces from agentic-coding harnesses that use multi-turn function calls, not Aider's single-prompt-with-diff edit format. The model was producing what its training distribution rewarded; Aider's harness was scoring it on a different distribution entirely.

Mistral itself adds, in the same README:

we advise everyone to use the Mistral AI API if the model is subpar with local serving

That's an explicit caveat from the model authors. The 4% wasn't a model failure or a runtime failure. It was a harness-distribution mismatch, exactly the failure mode the README warned about.

6. Same model, three benchmarks, three answers

Benchmark	Devstral 2 score
Aider Polyglot, diff format	4.0%
Aider Polyglot, whole format	8.0%
HumanEval+ (with adversarial tests)	81.7%
HumanEval (base)	85.4%

Twenty times difference in measured "performance" on the same model, same hardware, same temperature, same week. This is the lesson worth taking away from the entire bench session.

If you publish a single benchmark number for any agentic coding model, you are publishing a story about that model's compatibility with one specific harness, not a story about the model's coding capability. The Devstral 2 4% on Aider does not mean Devstral 2 is bad at coding. The Devstral 2 81.7% on HumanEval+ does not mean Devstral 2 is good at agentic edits in your IDE. They are both true and they describe different things.

If you want to evaluate a coding model, run it through the harness you actually use day to day. If you can't, then quote at least two benchmarks from different parts of the harness landscape (one function-level, one agentic) and let the reader see the spread.

7. InferCost was running the whole time

While the benchmarks were producing accuracy numbers, InferCost was producing the cost numbers. The new Apple Silicon collector (shipped in InferCost v0.3.0) was reconciling the apple-m5-max CostProfile every 30 seconds against the LLMKube Metal Agent's apple_power_combined_watts gauge.

Specifically, two things were running in the background of every benchmark above:

A second LLMKube Metal Agent on port 9091 with --apple-power-enabled, publishing the four new apple_power_*_watts Prometheus gauges sourced from a sudo'd powermetrics subprocess. Pinned-argv NOPASSWD sudoers entry to keep the privilege grant tight (security audit caught and fixed three findings before merge: argv pinning, bin override rejection, absolute /usr/bin/sudo to defeat $PATH attacks).
InferCost as a local controller, pointed at :9091/metrics via the new --metal-endpoint CLI flag, reconciling an apple-m5-max CostProfile using the new Metal scraper and dispatcher.

Plus a tiny CSV poller that sampled both layers every 60 seconds, writing 388 rows of telemetry across the day.

Per-window aggregates, captured live during the runs:

Window	Duration	Mean combined W	Mean InferCost $/hr	Agent ↔ InferCost Δ (mean)
Qwen3.6-35B-A3B Q8 (full Aider)	200 min	27.3 W	$0.1775	1.60 W
Devstral 2, Aider diff	32 min	32.7 W	$0.1773	6.21 W
Devstral 2, Aider whole	29 min	35.3 W	$0.1774	8.08 W
Devstral 2, HumanEval+	55 min	29.0 W	$0.1770	0.90 W

The "Agent ↔ InferCost Δ" column is the validation result. The agent reads powermetrics every second; InferCost samples the gauge during its 30-second reconcile loop. If they were deeply wrong about each other we'd see double-digit deltas. We don't. Across the four windows, mean delta ranged from 0.9 W to 8 W (the 8 W was during Aider whole format, which has bursty prefill that the 30-second reconcile sometimes catches mid-spike). For sustained workloads the agreement is sub-watt.

Here is what kubectl get costprofile apple-m5-max -o yaml looked like during the run:

status:
  currentPowerDrawWatts: 39.13
  hourlyCostUSD: 0.1805
  amortizationRatePerHour: 0.17466
  electricityCostPerHour: 0.00341
  conditions:
  - type: MetalReachable
    status: "True"
    reason: MetalHealthy
    message: "Metal agent healthy at http://localhost:9091/metrics
              (39.1W combined; gpu=37.3 cpu=1.8 ane=0.0)."
  - type: Ready
    status: "True"
    reason: CostComputed
    message: "Hourly cost: $0.1805 (amort: $0.1747, elec: $0.0059)"

Not a screenshot. Not a slide. The actual reconcile output from a Kubernetes operator scraping a sudo'd powermetrics subprocess on the same Mac that was running the benchmark.

8. The cost economics

The $4,500 laptop, amortized over 3 years, with maintenance at 2% of the purchase price flat:

Amortization per hour: $4,500 × 1.02 / 3 / 8760 = $0.17466/hr
Electricity at 41 W and $0.08/kWh (Peninsula Light residential rate, WA): 0.041 × 0.08 = $0.00328/hr
Total hourly: $0.178/hr, of which 98.1% is amortization and 1.9% is electricity

That ratio is the most useful thing the bench taught us. The marginal cost of running an LLM on a laptop you already own is essentially the electricity, which on Apple Silicon is genuinely cheap. The amortized cost is the laptop existing at all, which you pay whether or not the model runs.

Two $/MTok numbers from the windows where the token poller was working correctly:

Window	Total tokens	$/MTok
Devstral 2, Aider whole (sustained edits)	158,614	$0.30
Devstral 2, HumanEval+ (sequential function calls)	90,916	$1.76

Aider's whole-file edits keep the GPU producing tokens for longer continuous bursts, which spreads the fixed amortization across more output. HumanEval+ runs many short function-level problems with eval-script setup time between them, which inflates the per-token cost because the laptop is "active" but not generating.

Stacked against Anthropic's published 2026 pricing of $3/MT input + $15/MT output for Claude Sonnet 4.6, blended around $6 to $9 per million total tokens depending on input:output ratio:

Local Devstral 2 sustained at $0.30/MTok: about 30× cheaper at the margin than cloud Sonnet 4.6.
Local Devstral 2 with idle gaps at $1.76/MTok: about 5× cheaper at the margin.

Both ratios assume the laptop is running 24/7 for the 3-year amortization horizon. If you actually use the laptop 8 hours a day, the effective amortization-per-active-hour is 3× higher, which compresses the ratio. If you use it 2 hours a day, 12× higher, ratio collapses. The InferCost UsageReport CRD is built specifically to compute the active vs idle split over a billing period, which is the FinOps question that nobody else is answering for Apple Silicon.

9. What we shipped today, and how to use it

Two releases shipped alongside this post, both of which were necessary to do the cost story above end to end:

LLMKube v0.7.2: Apple Silicon power gauges via powermetrics + one-command sudoers install

Adds 4 new Prometheus gauges (combined / gpu / cpu / ane watts) to the existing Metal Agent (PR #334)
Sourced from a sudo'd powermetrics --samplers cpu_power,gpu_power -i 1000 subprocess
Opt-in via --apple-power-enabled flag (defaults off)
NOPASSWD sudoers fragment with pinned argv for safe install (security audit caught and fixed three findings before merge: argv pinning, --powermetrics-bin override rejection, absolute /usr/bin/sudo to defeat $PATH substitution attacks)
New make install-powermetrics-sudo and make uninstall-powermetrics-sudo targets (PR #336) so the privileged install is one command instead of a 5-line sed + visudo + install shell incantation
Coverage gap closed: extracted helper at 100% test coverage
Zero impact on existing setups; without the flag, behavior is unchanged

InferCost v0.3.0: Apple Silicon (Metal) power collector

Adds internal/scraper/metal.go mirroring the existing DCGM scraper (PR #47)
New MetalReachable condition with reasons MetalHealthy / MetalNotConfigured / MetalScrapeError / MetalSamplerOff so operators on a Mac don't see "DCGM unreachable" messages
10-line dispatcher in the CostProfile reconciler keys off MetalEndpoint set + looksApple(gpuModel)
New apple-m5-max.yaml sample CostProfile and updated apple-m2-ultra.yaml with real setup steps
8 controller tests + 5 scraper tests; existing DCGM tests untouched

If you have a MacBook Pro M5 (or M3/M4 Max with enough memory), the full install is now five short steps:

# 1. Install llama.cpp (needed by the Metal Agent for serving GGUF weights)
brew install llama.cpp

# 2. Install LLMKube via Helm
helm repo add llmkube https://defilantech.github.io/llmkube
helm install llmkube llmkube/llmkube --version 0.7.2

# 3. Build + install the Metal Agent and grant powermetrics access
git clone https://github.com/defilantech/llmkube && cd llmkube
make install-metal-agent          # builds + installs the launchd service
make install-powermetrics-sudo    # one-command pinned-argv NOPASSWD sudoers install

# 4. Restart the agent with --apple-power-enabled in your launchd plist
#    (edit ~/Library/LaunchAgents/com.llmkube.metal-agent.plist, then reload)
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load   ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

# 5. Deploy InferCost pointed at the agent and apply the sample CostProfile
helm repo add infercost https://defilantech.github.io/infercost
helm install infercost infercost/infercost --version 0.3.0 \
  --set metal.endpoint=http://localhost:9090/metrics
kubectl apply -f https://raw.githubusercontent.com/defilantech/infercost/main/config/samples/costprofiles/apple-m5-max.yaml

# Watch the live reconcile
kubectl get costprofile apple-m5-max -o yaml

The make install-powermetrics-sudo step is the one privileged moment: sudo prompts you for your password, the make target validates the sudoers syntax with visudo -cf before installing, then echoes the granted command back so you can verify exactly what was authorized. The grant is scoped to /usr/bin/powermetrics --samplers cpu_power\,gpu_power -i [0-9]* and nothing else. To remove it later, make uninstall-powermetrics-sudo.

Edit purchasePriceUSD, electricity.ratePerKWh, and nodeSelector in the CostProfile to match your reality.

Both projects are open source and hungry for the kind of feedback that comes from running them on hardware we don't have:

LLMKube (github.com/defilantech/llmkube). Kubernetes-native LLM serving operator. Runs llama.cpp and vLLM on NVIDIA, Metal Agent for Apple Silicon. Stars and good-first-issue PRs both very welcome. The Metal Agent in particular benefits enormously from Mac-having developers running it through --apple-power-enabled, finding the edge cases we missed, and filing issues.
InferCost (github.com/defilantech/infercost). Kubernetes-native AI FinOps. Cost attribution per workload, namespace, and model, with both NVIDIA (DCGM) and now Apple Silicon (this PR) power sources. The UsageReport CRD is the next thing to push on; if you have a multi-Mac fleet or a mixed NVIDIA+Apple environment, we'd love to hear what reports would help your team.

10. Reproducibility

Every number in this post traces back to a file you can pull or a benchmark you can re-run.

LLMKube: github.com/defilantech/llmkube, main branch at commit 58a94a7 (PR #334 merged). Issue #335 closed.
InferCost: github.com/defilantech/infercost, main branch at commit 422a4f0 (PR #47 merged). Issue #46 closed.
Aider Polyglot harness: github.com/Aider-AI/aider with polyglot-benchmark exercises.
Aider Polyglot leaderboard: polyglot_leaderboard.yml on aider main.
evalplus: github.com/evalplus/evalplus, scored via ganler/evalplus container for the macOS RLIMIT_AS workaround.
Run scripts: aider/run-aider-polyglot.sh (Qwen) and aider/run-aider-devstral.sh (Devstral) on this host, both straightforward Bash that invoke the Aider docker container with the right model id and edit format.
Power + cost telemetry: /tmp/infercost-m5max-telemetry.csv (388 power samples) and /tmp/infercost-m5max-tokens.csv (333 llama-server token-counter samples). Window markers (# QWEN_RUN_END, # DEVSTRAL_RUN_START, etc.) inline in the CSV.
Sample CostProfile: config/samples/costprofiles/apple-m5-max.yaml in the InferCost repo.

If a reproducer hits something different, please open an issue against whichever repo is the closest fit. The Apple Silicon path in particular is brand new, and the cohort of people who could give it a real workout is small but motivated.

What's next

A few things the data points to:

The InferCost UsageReport CRD needs a real multi-day test on a Mac running mixed inference + idle. The active vs idle split is the FinOps lever for local models, and we have one day of data; we want a month.
Multi-Mac fleet support in InferCost (auto-discovery of LLMKube Metal Agents via label selector) would let teams deploy InferCost once and have it follow agents around. Issue tracking that is open.
We benched Devstral 2 on Aider and HumanEval+. We did not bench it on its native scaffold (Mistral Vibe / OpenHands / Cline). That comparison is the right one for a daily-driver evaluation and it's the next thing we'll publish.

If you're running local LLM inference on your own hardware and care about either the serving side (LLMKube) or the cost side (InferCost), the easiest way to push these projects forward is to point them at your environment, file the issue you'd want to fix, and let us know what number would actually help your team.

Both projects are Apache 2.0. Stars on LLMKube and InferCost are appreciated and signal the kind of validation that helps prioritize the next round of work.

We ran Qwen3.6-27B on $800 of consumer GPUs, day one: llama.cpp vs vLLM

Christopher Maher — Fri, 24 Apr 2026 03:06:36 +0000

Originally published at llmkube.com/blog/qwen3-6-27b-bakeoff. Cross-posted here for the dev.to audience.

A Kubernetes-native bake-off on 2× RTX 5060 Ti, with reproducible manifests and a cost-per-token number neither cloud nor OSS FinOps tools will tell you.

This is a runtime comparison, not a model evaluation. Both llama.cpp and vLLM serve the same Qwen3.6-27B in every cell; we're measuring how the two serving stacks differ on identical work. Where cloud APIs enter in §8, it's on cost, not capability — this post makes no claim about whether Qwen3.6-27B "beats" GPT-4o or Claude on task quality.

TL;DR

Qwen3.6-27B (Tongyi Lab, released 2026-04-21, Apache 2.0) runs on a pair of RTX 5060 Ti 16 GB consumer cards via Kubernetes + LLMKube. Total hardware: about $800 street.
vLLM wins throughput by 3 to 4× at high concurrency thanks to NVFP4 and PagedAttention. llama.cpp plus TurboQuant wins context — we served one 43K-token prompt end-to-end (a single captured sample; higher-concurrency cells timed out on our 300 s harness budget) on hardware where vLLM's in-memory cap is 16K.
Cost per million tokens is two numbers, not one: $0.13 amortized (full cost of ownership) and $0.010 marginal (electricity during active serving). At 32.7% utilization over the bench window, the 13× gap between them is the real FinOps conversation.
Everything is reproducible. Manifests, harness, and summary.csv at github.com/defilantech/llmkube-bench.

1. Why we did this

Two days ago, Tongyi Lab dropped Qwen3.6-27B with the claim it matches frontier agentic-coding models at the 27B parameter count. The community response was predictable: does this actually work locally, or is it another model that benchmarks well but nobody can run? (Note for readers comparing against Qwen3.6-35B-A3B: the 27B is the non-MoE sibling. None of the MoE-specific flags like --cpu-moe apply here.)

The ecosystem has a harder time answering "how should I serve it?" There are two dominant open-source inference runtimes for models like this, and they optimize for different things:

llama.cpp — ubiquitous, GGUF-based, broad quantization support, runs on almost anything with a GPU. Adopted by the hobbyist and homelab crowd. Recently grew TurboQuant KV-cache compression (ggml-org/llama.cpp#20969), pushing achievable context windows on small VRAM into territory nobody else touches.
vLLM — throughput-focused, PagedAttention, continuous batching, FP8/NVFP4 on recent NVIDIA. The production serving runtime for teams running real traffic, targeting data center hardware.

The ecosystem answers "which should I use" with vibes and forum posts. We wanted numbers — from the same hardware, same model, same day the model dropped. If a 27B-class model can genuinely run on a pair of $400 GPUs, the practical question for anyone thinking about on-prem inference is which runtime makes that hardware actually worth something.

So we benchmarked both, published every configuration, and then turned the token counts into dollars using our companion tool InferCost, so the "is it cheaper than the cloud?" question has an honest answer rather than the usual founder-math.

2. Hardware and the constraint

The node running this bench is shadowstack — a microk8s cluster on a single box:


GPUs	2× NVIDIA GeForce RTX 5060 Ti 16 GB (Blackwell GB206)
GPU memory	15.48 GiB usable per card after driver reserve (30.96 GiB aggregate)
OS	Ubuntu 24.04.3 LTS, kernel 6.17.0-oem
Kubernetes	MicroK8s v1.32.13
Orchestration	LLMKube operator (chart 0.7.0) + NVIDIA GPU Operator + DCGM exporter
Street price	about $400/card × 2

5060 Ti is a Blackwell consumer GPU with native FP4 hardware. That is load-bearing. Without NVFP4, the 27B class is out of reach. At BF16 the model would need about 55 GB, at FP8 about 28 GB, at NVFP4 about 14 GB. Only the last one fits 2× 16 GB with room for activations and KV cache.

The VRAM budget is the whole story. On enterprise hardware (H100, A100, even the 3090 that the community's "qwen 27B on a 3090" discourse is built on), most of this bake-off's complexity disappears. On 2× 16 GB consumer cards you are constantly one configuration flag away from an out-of-memory crash, and the runtime that lets you navigate that wins real users.

3. The first attempt that didn't work

Our original target was Qwen/Qwen3.5-27B-FP8 (Qwen's official FP8 safetensors, the model everyone was excited about). On paper: 28 GB weights, TP=2, about 14 GB per shard. Should fit.

It doesn't. Qwen's 27B-class FP8 release is a VLM — the checkpoint includes a vision encoder that stays resident in VRAM whether or not you ever send an image. Three successive mitigations on vLLM, each measured against the crash logs:

1. Default config. OOM during profile_run on the vision encoder:

CUDA out of memory. Tried to allocate 576.00 MiB.
GPU 0 has a total capacity of 15.48 GiB of which 175.19 MiB is free.
This process has 15.30 GiB memory in use.

2. --limit-mm-per-prompt image=0,video=0, maxModelLen 16K, max-num-batched-tokens 4K. Skipped multimodal dummy inputs during profile. The vision encoder weights stay resident. OOM now at determine_available_memory:

Tried to allocate 1.19 GiB.
GPU 0 has 1.02 GiB free.
This process has 14.45 GiB in use.

3. --gpu-memory-utilization 0.95, PYTORCH_ALLOC_CONF=expandable_segments:True. Pushed against the wall:

Tried to allocate 32.00 MiB.
GPU 0 has 3.19 MiB free.
This process has 15.47 GiB in use.

15.47 of 15.48 GiB. No knob left. Qwen3.5-27B-FP8 cannot be served via vLLM on 2× 16 GB consumer cards in any configuration we found. A 3090 or 4090 (24 GB) would have considerably more headroom for the vision encoder plus KV cache (we didn't reproduce on one, but it's plausible the default config would fit there). That's a real hardware-sizing footnote to the "run 27B locally" discourse, since not every pair of 16 GB cards is enough.

Then Qwen3.6-27B dropped, and within 24 hours the community had published NVFP4 quants that halve the weight footprint again. That is the pivot that made this bench possible.

4. Method

Both runtimes run Qwen3.6-27B, served via LLMKube as a Kubernetes Deployment with OpenAI-compatible endpoints, and are benchmarked against each other on identical workloads. All manifests live in the public repo.

llama.cpp candidate


Source	`unsloth/Qwen3.6-27B-GGUF` Q4_K_M (~17 GB)
Parallelism	`split-mode=layer` across both GPUs
KV cache	TurboQuant `tbqp3` (keys) + `tbq3` (values) — about 3 bits/element
Max context	65,536
Image	AmesianX's TurboQuant fork v1.5.2, built from source (Kaniko manifest in the bench repo; retarget to your own registry to reproduce)
Flash attention	on
Parallel slots	16 for short patterns (chat, coding, agentic), 1 for long-context patterns (`long_context`, `long_context_extreme`)

TurboQuant is AmesianX's llama.cpp fork implementing the KV-cache compression algorithm from Google Research's TurboQuant paper. Asymmetric: QJL correction (tbqp*) on keys only because keys feed Q·K inner products while values go through a softmax-weighted sum. Our own internal benchmarks show about 60% KV cache reduction vs f16 at the same context, the table stakes for pushing context on small VRAM.

The slot count asymmetry matters and we want to be upfront about it: llama.cpp divides --ctx-size by --parallel to get per-slot context. With parallelSlots=16 and 65K total context, each slot gets 4 K tokens, which is enough for chat/coding/agentic prompts but rejects 5 K+ long-context requests. Dropping to parallelSlots=1 gives every request the full 65 K, at the cost of serving concurrent long-context requests from a queue. Readers should treat llama.cpp's long_context c=16/c=64 numbers as queue-behavior measurements, not throughput measurements.

vLLM candidate


Source	`sakamakismile/Qwen3.6-27B-NVFP4` (~14 GB)
Parallelism	tensor-parallel (TP=2)
Quantization	`compressed-tensors` wrapping NVFP4 (Blackwell-native 4-bit float)
KV cache	FP8 E4M3 (8 bits)
Max context	16,384
Attention backend	FLASHINFER
CUDA graphs	disabled (`--enforce-eager`)
Prefix caching	on
Chunked prefill	on
Image	`vllm/vllm-openai:latest`

Two forced choices here deserve a note:

--enforce-eager because CUDA graph capture for NVFP4 plus VLM weights plus KV cache exhausts the 15.48 GiB budget before KV init even starts. Skipping graph capture costs about 10 to 15% throughput, which becomes part of the fair comparison: on this hardware class vLLM gives up one of its own optimizations.
maxModelLen: 16384 is not "the model's ceiling". It is what fits after NVFP4 weights (14 GB / 2 = 7 GB/shard), vision encoder (~2 GB), KV cache at FP8, and activations. 32K OOMs during profile; 16K fits with about 1 GiB headroom.

Workloads

Five patterns × four concurrency levels per runtime:

Pattern	Shape	Purpose
chat	128-in / 256-out, 20 prompts	Interactive baseline
coding	1K-in / 1K-out, 20 prompts	Typical code-gen turn
long_context	~5K-in / 1K-out, 10 prompts	Code review, RAG-heavy
long_context_extreme	~43K-in / 1K-out, 10 prompts	vLLM's 16K cap cannot attempt this
agentic	4K shared prefix + 512 delta / 512-out, 20 prompts	Stresses prefix caching

Concurrency 1, 4, 16, 64. Per cell: 2 min warmup (discarded) + 5 min measurement. Temperature 0, seed 42, streaming on.

The full workload matrix is 40 cells (5 × 4 × 2 runtimes). We run 36 of them. long_context_extreme is not attempted on vLLM because its 16K cap would reject every prompt before submission. That asymmetry is one of the bake-off's findings, not a methodology gap.

5. Results: throughput and latency

Single-request latency (c=1)

pattern	llama.cpp TTFT p50	vLLM TTFT p50	Winner
chat	208 ms	157 ms	vLLM
coding	413 ms	106 ms	vLLM
agentic	911 ms	409 ms	vLLM
long_context (5K)	2,279 ms	581 ms	vLLM

vLLM is faster at single-request latency across the board, typically 2 to 4× on prefill-heavy patterns. llama.cpp plus TurboQuant pays a prefill tax: compressing the KV cache to about 3 bits per element is memory-cheap and compute-expensive. On short prompts the gap is narrow; on long prompts it opens up.

Quantization caveat: these numbers compare Q4_K_M (llama.cpp) against NVFP4 (vLLM). They are not the same quantization, and on this hardware there is no apples-to-apples option: llama.cpp doesn't ship an NVFP4 runtime, and Q4_K_M has no vLLM implementation. We've filled out a side-by-side output-quality check in QUALITY-GATE.md so readers can judge whether the two quants produce comparable answers at this parameter count. Read the speed numbers as "at each runtime's native quant on this hardware," not "at identical model quality."

Throughput under load (c=64)

pattern	llama.cpp tok/s	vLLM tok/s	Ratio
chat	94	345	3.7×
coding	133 (60% success)	377	2.8×
agentic	72	262	3.6×

This is vLLM's home turf. PagedAttention plus continuous batching turn 64 concurrent requests into about 90% GPU utilization; llama.cpp's slot-based scheduling (even with 16 parallel slots) serializes far more aggressively. The coding c=64 drop to 60% success on llama.cpp is KV cache saturation: at 16 slots by about 2K per-slot context, heavy coding prompts overflow.

Inter-token latency

Stable and tight on both runtimes. Median ITL:

llama.cpp: 49 to 175 ms/token across patterns and concurrencies
vLLM: 64 to 67 ms/token across patterns and concurrencies (remarkably flat, because continuous batching amortizes decode across the batch)

The llama.cpp ITL spread widens at high concurrency as slot contention kicks in. vLLM's is basically a constant, which is what makes it good for conversational workloads where you care about per-token cadence.

The honest version

vLLM wins the throughput axis. That's a real result, not a function of tuning. On 2× 16 GB consumer hardware with Qwen3.6-27B, if you're trying to maximize requests per second, vLLM is the answer, and it wins while giving up about 10 to 15% of its own throughput to --enforce-eager (disabled CUDA graphs were required to fit VRAM). The NVFP4 kernels on Blackwell, PagedAttention's batching, and continuous prefill scheduling all compound even with that handicap.

Except…

6. Results: context

The 5K baseline

Both runtimes serve long_context (about 5K input tokens, 1K output) at c=1 in about 13 seconds end-to-end. llama.cpp measures 20 tok/s, vLLM 19 tok/s. Near parity at this context size.

At higher concurrency the story differs because we configured llama.cpp with parallelSlots=1 to give every request the full 65K context (required for the extreme pattern, see below). Concurrency c=16 and c=64 on llama.cpp show queue saturation: the harness sends 16 or 64 concurrent requests, but the server processes them serially. That's not a throughput measurement, it's a queue measurement. On production llama.cpp with parallelSlots=16 and a smaller per-request context, short-prompt throughput would match our earlier numbers, but then you can't serve 43K prompts.

Which brings us to the real test

long_context_extreme: a roughly 43,000-token prompt in, 1024 tokens out.

vLLM, as configured here, can't attempt this. Its maxModelLen is 16K, set that way because 32K OOMs during graph capture on this hardware. A 43K-token request is rejected before it reaches inference. We did not explore --swap-space CPU offload, which in principle could trade a lot of latency for more context; that's a follow-up. Out of the box on 2× 16 GB consumer cards with Qwen3.6-27B NVFP4, we did not find an in-memory configuration that serves 43K.

llama.cpp plus TurboQuant served it. One sample captured at c=16 end-to-end:

Prompt tokens: about 43,000
Prefill time (TTFT): 186 seconds (3.1 min)
Decode rate: 171 ms/token
Output: 1024 tokens in about 175 seconds
Total wall time: about 6 minutes per request

This is not fast. It's not meant to be fast. What it is, is possible. TurboQuant's roughly 3-bit KV cache makes the memory math work where FP16 or FP8 KV can't. On the same hardware, at the same moment, one runtime cannot attempt the workload and the other completes it.

The higher-concurrency cells for this pattern hit our harness's 300s per-request timeout because decode plus prefill combined exceeds 300s. Bumping the harness timeout to 600s would capture all four c-levels cleanly; that's a follow-up. The c=1 and c=16 samples are enough to prove the capability.

The real tradeoff

Throughput versus context is the tradeoff, not "vLLM is better" or "llama.cpp is better". On this hardware:

Production chat, interactive coding, short agentic loops (≤ 8K context): vLLM. 3 to 4× throughput, lower TTFT, better ITL stability.
Long-document review, RAG with full-file context, overnight batch agentic on 40K+ codebases (> 16K context): llama.cpp plus TurboQuant. Slower per token, but it's the only runtime that serves the workload at all.

For many real workloads the answer is "run both." vLLM for the chat endpoint, llama.cpp for the batch endpoint that processes whole PRs overnight.

7. What it costs

Throughput numbers are interesting. Dollars per token are what actually get budgets approved.

InferCost is our companion tool: a Kubernetes operator that reads real-time GPU power draw from DCGM, combines it with hardware amortization and electricity rates declared on a CostProfile CR, and computes the real cost of inference. It discovers inference pods by the inference.llmkube.dev/model label LLMKube stamps on each Deployment, scrapes each pod's /metrics endpoint directly (no Prometheus required), and writes cost attribution into a UsageReport custom resource.

Here's a live UsageReport status from shadowstack, captured after a 10-minute mixed workload:

$ kubectl -n bench get usagereport bench-window -o yaml
...
status:
  period: "2026-04-23"
  periodStart: "2026-04-23T00:00:00Z"
  periodEnd:   "2026-04-23T21:21:42Z"
  inputTokens:  638
  outputTokens: 12400
  activeEnergyKWh:     0.645
  activeHoursInPeriod: 4.53
  totalHoursInPeriod:  21.36
  utilizationPercent:  21.20
  estimatedCostUSD:             0.83
  costPerMillionTokens:         63.79   # amortized
  marginalCostPerMillionTokens:  3.96   # electricity during active serving
  byModel:
  - model:     qwen36-27b-llamacpp
    namespace: bench
    inputTokens:  638
    outputTokens: 12400
    costPerMillionTokens: 63.79
    estimatedCostUSD: 0.83
  byNamespace:
  - namespace: bench
    tokenCount: 13038
    estimatedCostUSD: 0.83

The numbers look alarming at first: $63.79/MTok amortized for a tiny workload against a day's worth of hardware amortization. That's the point. At 21.2% utilization over this window, amortized is 16× higher than marginal. Scale up the utilization and the amortized number drops toward the marginal one; that's what the bench window numbers below capture.

The full bench window (Apr 23, 2026, 00:00 UTC → 10:07 UTC, ~10 hours), from summary.csv cross-referenced with the CostProfile spec:

Metric	Value
Total input tokens	2,518,242
Total output tokens	1,233,143
Total tokens	3,751,385
Active GPU energy	0.459 kWh
Utilization (active hours / wall-clock hours)	32.7%
Total dollar cost (amortization + electricity)	$0.50

Hardware amortization on the CostProfile spec: 2× RTX 5060 Ti at $480 each = $960, 3-year useful life, 5% annual maintenance. Electricity $0.08/kWh, PUE 1.0.

The two numbers

Metric	Value	Which question it answers
`costPerMillionTokens` (amortized)	$0.13	"What did my hardware cost per token I served today?"
`marginalCostPerMillionTokens`	$0.010	"What did the electricity actually cost to generate those tokens?"

Both numbers are correct. They answer different questions.

Amortized $0.13/MTok spreads the full cost of hardware ownership (amortization, idle electricity, active electricity) across whatever tokens you served today. It tells you the answer to "was today's inference worth what we paid for the hardware?" At 32.7% utilization, you're leaving about two-thirds of the compute capacity you already bought idle, and the amortized rate reflects that.

Marginal $0.010/MTok includes only the electricity drawn during active serving. It answers "what did these specific tokens cost me beyond what I'd be paying anyway?", the relevant comparison when cloud APIs only bill marginally.

The 13× gap between them is the entire FinOps conversation. At 100% utilization the two numbers converge; at low utilization they diverge by more than an order of magnitude. Neither is the "right" number. They describe different things.

8. Cloud comparison

Cloud APIs bill marginally. That's how they work: no inference, no invoice. So the fair comparison against on-prem is marginal versus marginal. Cloud prices below are output token pricing on public pricing pages as of April 2026; check each provider for current rates and input-vs-output splits.

Provider / Model	Output $/MTok	On-prem ratio (marginal)
shadowstack marginal	$0.010	1×
OpenAI GPT-4o	$10.00	1,000× cheaper on-prem
Google Gemini 2.5 Pro	$10.00	1,000× cheaper on-prem
Anthropic Claude Opus 4.5	$25.00	2,500× cheaper on-prem

Those ratios are almost offensive. They're also the upper bound — the ceiling of savings if you saturated this hardware.

The floor, at the bench window's 32.7% utilization (i.e., our actual mixed-workload cost over ten hours), uses the amortized number:

Provider / Model	Output $/MTok	On-prem ratio (amortized at 32.7%)
shadowstack amortized	$0.13	1×
OpenAI GPT-4o	$10.00	77× cheaper on-prem
Google Gemini 2.5 Pro	$10.00	77× cheaper on-prem
Anthropic Claude Opus 4.5	$25.00	192× cheaper on-prem

Even the worst case, amortized cost at 32.7% utilization, is 77× cheaper than GPT-4o or Gemini 2.5 Pro on output tokens. Against Claude Opus 4.5 (Anthropic's flagship large-frontier model), on-prem is 192× cheaper dollars-for-dollars. Those numbers do narrow on a blended input-plus-output basis, but the direction doesn't change.

For context on the hardware investment: $960 of GPUs pays for itself in Opus 4.5 output tokens at roughly 38.4 million tokens of traffic. At a modest 100K output tokens a day that's about a year; at 1M output tokens a day (a small agentic coding team), it's under six weeks. Against GPT-4o or Gemini 2.5 Pro the break-even point is 96M output tokens: ~2.6 years at 100K/day, ~3 months at 1M/day. Input tokens are cheaper on every cloud model, so a realistic blended workload stretches those numbers modestly, but not by an order of magnitude.

This math is why enterprises with serious inference budgets are re-examining on-prem. It's not about paranoia or data residency (though those help). It's that the marginal economics on modern consumer GPUs, with the right runtime, genuinely work.

9. Reproduce it yourself

Everything is in the public repo: github.com/defilantech/llmkube-bench.

# Requires: K8s cluster with LLMKube v0.7+, 2× NVIDIA 16+ GB, DCGM exporter,
# hf-token Secret in the bench namespace.
git clone https://github.com/defilantech/llmkube-bench.git
cd llmkube-bench
make install                                      # Python deps via uv
make bench RESULTS_DIR=results/$(date +%F)-myhw   # ~3-4 hours for full matrix

That's the workstation path. The bench also runs fully in-cluster — a Kaniko Job builds the harness image, a bench-runner Job with a scoped ServiceAccount orchestrates the runtime swaps, results land on a hostPath volume. See manifests/bench-runner/README.md.

Every number in this post traces to a row in results/2026-04-23-shadowstack/summary.csv. Every manifest, every image digest, every Prometheus snapshot is committed.

10. What's next

A few things we'd do differently on the next bench:

Raise the harness per-request timeout from 300s to 600s so long_context_extreme at higher concurrencies captures cleanly. The one sample we got is defensible; four clean samples would be better.
Test with Qwen's own FP4 release once they ship one. The sakamakismile community NVFP4 has been solid for the throughput measurements, but an official Qwen FP4 would remove a variable from the methodology.
Multi-node llama.cpp would close the long-context throughput gap. Splitting layers across 4 GPUs instead of 2 gives per-shard VRAM headroom for higher --parallel settings and cuts the TurboQuant prefill time roughly in half.

But the big-picture answer is already here. On $800 of consumer GPUs, you can serve the same day's flagship open-source model, at either throughput that crushes cloud APIs or context lengths that no cloud provider offers at any price. And InferCost shows you the honest dollar math instead of the misleading single-number dashboards you'd get from every "AI observability" tool on the market.

If you want to follow along:

github.com/defilantech/llmkube — the Kubernetes operator running both runtimes in this bench
github.com/defilantech/infercost — the cost attribution controller producing the $/MTok numbers
github.com/defilantech/llmkube-bench — the full reproducible bench
@defilan on X — where the threads go

If this was useful, star the repos. If it was wrong about something, open an issue; the goal is accurate numbers, not winning arguments.

— Chris

LLMKube Now Deploys Any Inference Engine, Not Just llama.cpp

Christopher Maher — Wed, 08 Apr 2026 01:03:15 +0000

LLMKube started as a Kubernetes operator for llama.cpp. You define a Model, define an InferenceService, and the controller handles GPU scheduling, health probes, model downloads, and Prometheus metrics. It works well for GGUF models.

But llama.cpp isn't the only inference engine. vLLM has PagedAttention. TGI has continuous batching. PersonaPlex does real-time voice AI. Triton serves multi-framework models. Locking the operator to one runtime limits what you can deploy.

v0.6.0 changes that with pluggable runtime backends.

The Problem

Before v0.6.0, the controller's constructDeployment() was hardcoded to llama.cpp. Container name, image, command-line args, health probes, model provisioning, everything assumed llama.cpp. If you wanted to deploy vLLM, you had to create a manual Kubernetes Deployment outside of LLMKube.

The Fix

A RuntimeBackend interface that each inference engine implements:

type RuntimeBackend interface {
    ContainerName() string
    DefaultImage() string
    DefaultPort() int32
    BuildArgs(isvc, model, modelPath, port) []string
    BuildProbes(port) (startup, liveness, readiness)
    NeedsModelInit() bool
}

The controller calls resolveBackend(isvc) based on the runtime field in the CRD, then delegates all container configuration to the backend. llama.cpp is the default. New runtimes register in a simple switch statement.

Testing It: PersonaPlex on Kubernetes

To prove the architecture works, I deployed NVIDIA's PersonaPlex on my home lab. PersonaPlex is a 7B speech-to-speech model based on Moshi. It listens and talks at the same time. Sub-300ms latency for interruptions. Completely different from llama.cpp: PyTorch runtime, WebSocket-based health checks, model downloaded via HuggingFace token.

The InferenceService CRD:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: personaplex
  namespace: voice-ai
spec:
  modelRef: personaplex-7b
  runtime: personaplex
  image: registry.defilan.net/personaplex:7b-v1-4bit-cuda13
  personaPlexConfig:
    quantize4Bit: true
    hfTokenSecretRef:
      name: hf-token
      key: HF_TOKEN
  endpoint:
    port: 8998
    type: NodePort
  resources:
    gpu: 1
    memory: "32Gi"

kubectl apply and it's running. The controller:

Sets the container command to python -m moshi.server (via the PersonaPlex backend's CommandBuilder)
Configures TCP socket probes on port 8998 (PersonaPlex uses WebSockets, not HTTP /health)
Injects HF_TOKEN from a Kubernetes Secret and NO_TORCH_COMPILE env var
Skips the model download init container (model downloads at startup via HF Hub)
Requests 1 GPU with 32Gi memory

The result: real-time voice conversation running on a single RTX 5060 Ti, managed by the same operator that handles my llama.cpp text inference.

Built-in vLLM Runtime

vLLM is probably the most requested inference engine in the Kubernetes ecosystem. v0.6.0 ships it as a first-class runtime with typed VLLMConfig:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: vllm-tinyllama
spec:
  modelRef: tinyllama-1b
  runtime: vllm
  image: vllm/vllm-openai:cu130-nightly
  skipModelInit: true
  vllmConfig:
    maxModelLen: 2048
    dtype: float16
    hfTokenSecretRef:
      name: hf-token
      key: HF_TOKEN
  resources:
    gpu: 1
    memory: "8Gi"

The controller generates the right args (--model, --tensor-parallel-size, --max-model-len, --quantization, --dtype), configures HTTP /health probes on port 8000, and injects HF_TOKEN from a Secret. I tested this on my cluster with TinyLlama-1.1B and got a working OpenAI-compatible endpoint in under two minutes.

Built-in TGI Runtime

HuggingFace's Text Generation Inference also ships as a built-in runtime. TGI downloads models directly from HuggingFace Hub, so skipModelInit isn't even needed. The TGIConfig supports quantization methods (bitsandbytes, gptq, awq, eetq), max token limits, and dtype.

The Generic Runtime

Not every inference engine needs first-class support. The generic runtime lets you deploy any container:

spec:
  runtime: generic
  image: my-custom-server:latest
  command: ["/app/serve"]
  args: ["--port", "8080"]
  containerPort: 8080
  skipModelInit: true
  probeOverrides:
    startup:
      tcpSocket:
        port: 8080
      failureThreshold: 60

You provide the image, args, probes, and env. The controller handles GPU scheduling, service creation, and lifecycle management.

Per-Runtime Autoscaling

Each runtime defines its default HPA metric via the HPAMetricProvider interface. When you enable autoscaling without specifying a metric, the controller picks the right one for your runtime:

llama.cpp: llamacpp:requests_processing
vLLM: vllm:num_requests_running
TGI: tgi:queue_size

No more hardcoded metric names.

Adding Your Own Runtime

docs/adding-a-runtime.md documents the full process: implement the RuntimeBackend interface, optionally add CommandBuilder, EnvBuilder, or HPAMetricProvider, register in the switch statement, add your CRD config struct, and run make manifests generate. The pattern is established with five working examples.

Everything Else in v0.6.0

CUDA 13 default image for RTX 50-series and Qwen3.5 support
Custom GPU layer splits for multi-GPU sharding
Helm image registry/repository separation for air-gapped deployments
Grafana inference metrics dashboard (tokens/sec, queue depth, KV cache, reconcile health)
imagePullSecrets on InferenceService for private registries
HPA autoscaling for InferenceService

What's Next

Triton Inference Server and Ollama as built-in runtimes. Better Model controller support for non-GGUF formats (HuggingFace repo IDs as sources). And potentially Kubernetes-native voice AI pipelines combining PersonaPlex with LLMKube-managed reasoning models.

GitHub: https://github.com/defilantech/llmkube

I tested speculative decoding on my home GPU cluster. Here's why it didn't help.

Christopher Maher — Mon, 06 Apr 2026 03:51:51 +0000

I spent Saturday night testing n-gram speculative decoding on consumer GPUs. The claim: speculative decoding can speed up LLM inference by 2-3x by predicting future tokens and verifying them in parallel.

I wanted to see if that holds up on real hardware running diverse workloads. For the most part, it doesn't. But the journey was worth it, and I caught a benchmarking pitfall that I think a lot of people are falling into.

The setup

My home lab runs Kubernetes on a machine called Shadowstack. Two NVIDIA RTX 5060 Ti GPUs (16GB VRAM each, 32GB total). I use LLMKube, an open source K8s operator I built, to manage LLM inference workloads with llama.cpp.

For this test I deployed two models:

Gemma 4 26B-A4B: Google's Mixture of Experts model. 26B total params but only ~4B active per token. Runs at 88 tok/s on my setup.
Qwen3-32B: A dense 32B model. All parameters active per token. Runs at 20 tok/s.

Both running Q4_K_M quantization, flash attention enabled, 8K context, split across both GPUs.

Quick note on why the MoE model is so much faster: Gemma 4 only activates a fraction of its parameters per token, so there's way less weight data to read from VRAM on each forward pass. MoE routing overhead eats into some of that advantage, but it's still a huge win on bandwidth-constrained hardware.

What I tested

llama.cpp has built-in n-gram speculative decoding. No draft model needed, you just pass a few flags:

--spec-type ngram-mod
--draft-max 64
--draft-min 48
--spec-ngram-size-n 24
--spec-ngram-size-m 48

How it works: llama.cpp builds an n-gram lookup table from the recent context (both the input prompt and generated output so far). When it spots a pattern it's seen before, it speculatively drafts the next several tokens and verifies them in a single forward pass. If the predictions are right, you get multiple tokens for the cost of one.

Important: this is specifically n-gram speculative decoding, not draft-model approaches like EAGLE-3 or Medusa. Those use a separate trained model to generate speculations. N-gram lookup is simpler and doesn't require any extra model files.

With LLMKube, switching between configs is just updating the extraArgs field in the InferenceService CRD and letting the operator restart the pod:

spec:
  modelRef: gemma4-26b-a4b
  extraArgs:
    - "--spec-type"
    - "ngram-mod"
    - "--draft-max"
    - "64"

I tested two variants: ngram-simple (basic lookup) and ngram-mod (the variant recommended for MoE models in the llama.cpp docs).

The result that fooled me

My first test ran the same prompt 10 times in a row. The numbers looked incredible:

Run	tok/s
1 (cold)	88.3
2	105.7
3	112.4
5	186.4
8	336.5
10	419.5

Almost 5x speedup by run 10. I was ready to write a very different article.

Then I ran 8 different prompts. Code generation, API design, Go functions, bash scripts, technical explanations. Real variety.

Prompt	Baseline (tok/s)	+ ngram-mod (tok/s)
BST implementation	88.3	94.2
K8s operator explanation	88.3	88.3
GPU monitoring script	88.3	87.6
REST API design	88.3	88.2
GGUF parser in Go	88.3	88.2
Parallelism explainer	88.3	88.1
Benchmark script	88.2	88.2
Helm chart design	88.1	88.2
Median	88.3	88.2

Zero improvement. The 419 tok/s "speedup" was the n-gram cache memorizing repeated output patterns. With diverse prompts, there's nothing useful to cache.

Same story on the dense model

Qwen3-32B showed the same pattern. 20.4 tok/s baseline, 20.6 tok/s with ngram-simple. Within measurement noise.

Model	Type	Baseline	+ ngram-simple	+ ngram-mod
Gemma 4 26B	MoE	88.3	87.2 (-1.2%)	88.2 (0%)
Qwen3-32B	Dense	20.4	20.6 (+1%)	not tested

Why it doesn't help on these GPUs

The bottleneck on RTX 5060 Ti is memory bandwidth, not compute. Every token requires reading model weights from VRAM. Speculative decoding tries to batch multiple verification steps together, but when you're already saturating the memory bus during single-token generation, there's not enough idle compute for the speculative verification to pay for itself.

This is different from high-end datacenter GPUs (A100, H100) where the compute-to-memory bandwidth ratio is much higher. An H100 has roughly 3,350 GB/s memory bandwidth but nearly 2,000 TFLOPS of FP16 compute. That ratio means there's genuine idle compute at small batch sizes that speculative decoding can exploit. Consumer GPUs don't have that same headroom.

For MoE models specifically, there's an additional wrinkle. Each speculative token in a verification batch may activate different experts, which means more expert weight blocks need to be read. This reduces the batching advantage that speculative decoding relies on in dense models, where weight reads stay roughly constant regardless of batch size.

Caveat: there are scenarios where n-gram spec decoding can help even on consumer hardware. If your model is partially CPU-offloaded (doesn't fit in VRAM), the PCIe bandwidth bottleneck is severe enough that speculative batching can provide real gains. And for highly repetitive or templated outputs (think structured JSON, boilerplate code), the n-gram cache hit rate goes way up. My testing focused on single-user inference with fully VRAM-resident models and diverse prompts.

What about EAGLE-3?

I originally wanted to test EAGLE-3, which uses a trained draft head instead of n-gram lookup. Three problems:

No EAGLE-3 draft model exists for Gemma 4 (no one has trained one)
The llama.cpp EAGLE-3 PR (#18039) is still open and in draft as of April 5, 2026
The PR's own benchmarks show MoE models getting roughly 0.89-1.06x on certain prompts, with some actually slower due to the expert activation overhead during batch verification

Even with a trained draft head, the fundamental bandwidth constraint on consumer GPUs would remain.

What actually helps on consumer GPUs

If you're running local LLMs on consumer hardware, here's what actually moves the needle:

Flash attention: Already standard, significant memory savings
KV cache quantization: q4_0 or q8_0 reduces cache memory pressure without meaningful quality loss
MoE over dense: Gemma 4 activates ~4B parameters per token vs Qwen3-32B's 32B. That's the primary driver of the throughput difference, though MoE routing overhead means the speedup isn't a clean 8x ratio.
Multi-GPU split: Doubles your available memory bandwidth, which is the actual bottleneck
Context size tuning: Smaller context = less KV cache = more VRAM headroom

The benchmarking lesson

The biggest takeaway wasn't about speculative decoding. It was about benchmark methodology.

If I'd only tested with repeated prompts, I would have reported a 4.75x speedup and been completely wrong. The n-gram cache is doing something real, but only in a narrow scenario where outputs are highly repetitive or templated. For interactive chat, coding assistance, or any workload with diverse inputs, it provides no benefit on this hardware.

Be skeptical of speculative decoding benchmarks that don't disclose their prompt diversity. And if you see someone reporting huge n-gram gains, check if they're running the same prompt over and over.

Try it yourself

Everything I tested runs on Kubernetes via LLMKube. The InferenceService CRD's extraArgs field makes it trivial to swap between configs without touching your deployment:

apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: gemma4-spec-bench
spec:
  modelRef: gemma4-26b-a4b
  image: ghcr.io/ggml-org/llama.cpp:server-cuda
  contextSize: 8192
  flashAttention: true
  extraArgs:
    - "--spec-type"
    - "ngram-mod"
    - "--draft-max"
    - "64"
  resources:
    gpu: 2

LLMKube is open source, Apache 2.0: github.com/defilantech/llmkube

Google Released Gemma 4 Yesterday. I Had It Fixing Real Bugs by Lunch.

Christopher Maher — Fri, 03 Apr 2026 16:34:48 +0000

Google released Gemma 4 yesterday. By the time I went to bed, I had it deployed on my home lab, running real coding benchmarks at 96 tokens per second.

The catch: no official llama.cpp image supported the gemma4 architecture yet. The stock CUDA images crash with unknown model architecture: 'gemma4'. So I built it from source, on the same Kubernetes cluster that serves inference.

This post is about what it took to go from "model dropped" to "running in production" in about two hours on consumer hardware.

The Setup

My home inference server (I call it ShadowStack):

2x NVIDIA RTX 5060 Ti (16GB each, 32GB total VRAM)
AMD Ryzen 9 7900X, 64GB DDR5
Ubuntu 24.04, MicroK8s
NVIDIA driver 590.48.01 (CUDA 13.1)

Everything is managed by LLMKube, a Kubernetes operator I built for running llama.cpp inference. One CRD to define the model, one CRD to define the service, the operator handles the rest.

Step 1: The Architecture Problem

First attempt, I tried the server-cuda13 image (CUDA 13 build of llama.cpp):

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'

The Gemma 4 architecture hadn't shipped in any released llama.cpp build yet. The support was only in HEAD.

Step 2: Build From HEAD On-Cluster

I have a Kaniko build pipeline on the cluster from a previous project (TurboQuant benchmarking). I wrote a Dockerfile that clones llama.cpp HEAD and builds with CUDA targeting SM 86 (Ampere) and SM 120 (Blackwell):

FROM nvidia/cuda:12.8.0-devel-ubuntu24.04 AS builder

RUN git clone --depth 1 https://github.com/ggml-org/llama.cpp.git /build/llama.cpp

WORKDIR /build/llama.cpp

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so \
          /usr/local/cuda/lib64/stubs/libcuda.so.1
ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}

RUN cmake -B build \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES="86;120" \
    -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --target llama-server -j$(nproc)

A Kaniko Job on the cluster built this in about 15 minutes and pushed it to my local container registry. The same cluster that runs inference also builds its own inference server. No external CI needed.

Step 3: Deploy

llmkube deploy gemma4-26b --gpu --accelerator cuda --gpu-count 2 \
  --source https://huggingface.co/Trilogix1/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --image registry.defilan.net/llama-server-latest:gemma4 \
  --flash-attn --jinja --context 32768

The model is 15.6 GB at Q4_K_M. With both GPUs, that leaves about 16 GB for KV cache. Plenty for 32K context.

The operator downloaded the model, created the Deployment with the right GPU flags, set up health probes, and exposed an OpenAI-compatible endpoint. From the deploy command to the first inference request was about 3 minutes (mostly model download time).

The Numbers

Single Request

Metric	Value
Generation	96 tok/s
Prompt processing	128 tok/s
Model size (Q4_K_M)	15.6 GB
Active parameters per token	4B (MoE)

Under Load (4 concurrent workers, 2 minutes)

Metric	Value
Aggregate throughput	170 tok/s
Total requests	110
Error rate	0%
P50 latency	~2s per request

For context, the generic benchmarks floating around say Gemma 4 26B-A4B "exceeds 40 tok/s on consumer hardware." We're doing 96 tok/s on a single request and 170 tok/s aggregate under concurrent load. The dual-GPU split and the MoE architecture (only 4B parameters active per token) make this model surprisingly fast.

Real Coding Benchmarks

I didn't just run "hello world" tests. I fed it actual bug reports from my own project and asked it to generate fixes.

Bug: GPU Rolling Update Deadlock

The issue: Kubernetes rolling updates deadlock on GPU workloads because the new pod can't schedule (old pod holds GPUs) and the old pod won't terminate (waiting for new pod to be Ready).

Gemma 4's response: correctly identified that GPU workloads should use Recreate strategy instead of RollingUpdate, with a conditional check on GPU count. Showed the chain-of-thought reasoning, considered edge cases, and verified against the pattern before outputting.

Time: 10.6 seconds for a 1024-token response including the full reasoning chain.

Bug: Stale Endpoints After Deletion

The issue: deleting an InferenceService leaves orphaned Kubernetes Endpoints.

Gemma 4's response: generated a complete UnregisterEndpoint method with DNS name sanitization, Service and Endpoints deletion, NotFound error handling, and logging. Production-quality Go code on the first try.

Time: 11.1 seconds.

Code Generation: Ginkgo BDD Tests

I asked it to write tests following an existing pattern in the codebase. It generated 4 correct test cases with BeforeEach setup, proper assertions, and the right Gomega matchers. Used ContainElements for present checks and NotTo(ContainElement()) for absent checks, matching the exact conventions from the rest of the test suite.

Time: 12.3 seconds.

What This Actually Means

I'm not claiming Gemma 4 replaces Claude or GPT-4. It doesn't. The reasoning is shallower on complex multi-step problems, and it occasionally cuts off mid-response at the token limit.

What I am claiming: the gap between "Google releases a new model" and "it's running on your hardware fixing real bugs" has shrunk to hours, not weeks. The pieces are:

GGUF quantization appears on HuggingFace within hours of a model release
llama.cpp HEAD usually has architecture support on day one (the tokenizer and template fixes were already committed)
Kaniko or similar tools let you build from source on-cluster without a separate CI pipeline
A Kubernetes operator (in my case, LLMKube) lets you deploy with one command and get health checks, metrics, and an OpenAI-compatible API

This is the same workflow regardless of whether the model is Gemma 4, Qwen3.5, Llama, or whatever ships next week. The infrastructure is model-agnostic.

The Hardware Math

This entire setup cost about $2,400:

2x RTX 5060 Ti: ~$800
Ryzen 9 7900X + motherboard + RAM + SSD + case + PSU: ~$1,600

Running 24/7, the system draws about 50-60W idle and 500-600W under full inference load. At $0.12/kWh, that's roughly $30-50/month in electricity for unlimited inference.

Compare to API costs: at OpenAI's pricing for a comparable model, 110 requests in 2 minutes would cost roughly $5-10. Scale that to continuous use and the hardware pays for itself in a month or two.

Try It

LLMKube is open source (Apache 2.0): github.com/defilantech/llmkube

If you have a GPU and a Kubernetes cluster (even a single-node K3s or MicroK8s), you can deploy any GGUF model with:

helm install llmkube llmkube/llmkube
llmkube deploy llama-3.1-8b --gpu

For Gemma 4 specifically, you'll need a custom llama.cpp image until the official builds ship with gemma4 architecture support. The Dockerfile above works.

Benchmarks run on April 2, 2026 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.1, driver 590.48.01). Gemma 4 26B-A4B-it Q4_K_M via llama.cpp built from HEAD commit f851fa5a.

I Tested TurboQuant KV Cache Compression on Consumer GPUs. Here's What Actually Happened.

Christopher Maher — Mon, 30 Mar 2026 15:12:24 +0000

I spent this weekend testing TurboQuant KV cache compression on my home lab Kubernetes cluster. The paper (ICLR 2026, Google Research) promises up to 4.57x compression of the KV cache with minimal quality loss. That sounded like exactly what I needed. I'm always bumping up against VRAM limits trying to run larger models or longer contexts on consumer hardware.

Here's what I found: it works, but there are real tradeoffs nobody's talking about yet.

The Problem: KV Cache Eats Your VRAM

If you've run LLMs locally, you know the drill. You load a 32B model that fits in 20GB of VRAM, set the context to 32K, and suddenly you're at 28GB. The model weights didn't change. It's the KV cache growing linearly with context length.

For every token in the context, the model stores key and value vectors for every attention head at every layer. In FP16, that adds up fast. A 32B model at 32K context can burn through 8+ GB of VRAM just for the KV cache.

TurboQuant's approach is to apply a Walsh-Hadamard Transform (WHT) rotation to KV cache vectors before quantizing them to 3 bits. The rotation "gaussianizes" the distribution, making scalar quantization much more effective. The result is TQ3_0: roughly 3 bits per element instead of 16, for a theoretical 4.57x compression.

My Setup

Hardware: ShadowStack, my home inference server

2x NVIDIA RTX 5060 Ti (16GB GDDR7 each, 32GB total)
AMD Ryzen 9 7900X, 64GB DDR5
Ubuntu 24.04, MicroK8s

Software: LLMKube, an open-source Kubernetes operator I built for managing llama.cpp inference workloads. It handles model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics through Kubernetes CRDs.

TurboQuant build: I used the animehacker/llama-turboquant fork, which has working CUDA kernels for the WHT-based TQ3_0 type. This is a Stage 1 implementation (no QJL residual correction from the full paper). I built it with Kaniko directly on my cluster targeting SM 86 (Ampere) and SM 120 (Blackwell).

The Wrapper Entrypoint Pattern

LLMKube's InferenceService CRD doesn't have a --cache-type flag yet, so I built a custom Docker image with a wrapper entrypoint that injects the TurboQuant flags transparently:

#!/bin/bash
# entrypoint.sh - passes through all LLMKube args, appends TQ flags
TQ_CACHE_TYPE="${TQ_CACHE_TYPE:-tq3_0}"
TQ_ENABLED="${TQ_ENABLED:-true}"

if [ "${TQ_ENABLED}" = "true" ]; then
    exec llama-server "$@" --cache-type-k "${TQ_CACHE_TYPE}" --cache-type-v "${TQ_CACHE_TYPE}"
else
    exec llama-server "$@"
fi

Using exec is important. It makes llama-server PID 1 so Kubernetes health probes and signal handling work correctly.

Benchmark Methodology

Apples-to-apples. Same model weights, same context size, same concurrency. The only variable was the KV cache type (FP16 vs TQ3_0). Flash attention was enabled for all tests.

Throughput test: 5 minutes of sustained load at 4 concurrent requests, 8K context.

Context sweep: Deploy at each context size (4K through 131K), run a 2-minute stress test, record VRAM via nvidia-smi.

Models tested:

Llama 3.1 8B (Q5_K_M), small model with lots of headroom
Qwen 2.5 14B (Q5_K_M), medium model that fills one GPU
Qwen 2.5 32B (Q4_K_M), large model that requires both GPUs

Results: Throughput

This is where TurboQuant hurts.

Model	Variant	Gen tok/s	Prompt tok/s	Requests (5min)
Llama 8B	FP16 cache	50.0	565.5	771
Llama 8B	TQ3_0 cache	8.4	93.4	74
Qwen 14B	FP16 cache	28.1	122.0	128
Qwen 14B	TQ3_0 cache	5.3	63.4	53
Qwen 32B	FP16 cache	14.3	133.3	108
Qwen 32B	TQ3_0 cache	5.5	85.5	53

Generation throughput dropped 5-6x across all models. Prompt processing dropped roughly 2-6x depending on model size. This is consistent with what the PR benchmarks showed on CPU, but I expected Blackwell's tensor cores to help more than they did. The animehacker CUDA kernels were optimized for Ampere (SM 86), not Blackwell (SM 120), so there's likely performance left on the table.

Results: VRAM Usage

This is where it gets interesting.

Llama 3.1 8B, Context Sweep

Context	FP16 VRAM (total)	TQ3_0 VRAM (total)	Savings
4K	6.4 GB	10.1 GB	-58% (worse)
8K	6.9 GB	14.3 GB	-107% (worse)
16K	8.0 GB	22.8 GB	-185% (worse)
32K	10.1 GB	6.9 GB	31% better
65K	14.3 GB	8.4 GB	41% better
98K	18.5 GB	9.8 GB	47% better
131K	22.7 GB	11.2 GB	51% better

Qwen 2.5 14B, Context Sweep

Context	FP16 VRAM (total)	TQ3_0 VRAM (total)	Savings
4K	11.1 GB	16.7 GB	-50% (worse)
8K	11.9 GB	23.0 GB	-93% (worse)
16K	13.4 GB	11.0 GB	18% better
32K	16.6 GB	11.8 GB	29% better
65K	22.8 GB	13.7 GB	40% better

Qwen 2.5 32B, Context Sweep

Context	FP16 VRAM (total)	TQ3_0 VRAM (total)	Savings
2K	19.9 GB	23.7 GB	-19% (worse)
4K	20.5 GB	27.9 GB	-36% (worse)
8K	21.6 GB	19.8 GB	8% better
16K	23.7 GB	20.3 GB	14% better
32K	28.0 GB	21.4 GB	24% better

The Surprise: TQ Uses MORE VRAM at Small Contexts

I wasn't expecting this. At 4K-16K context, TQ3_0 consistently used more VRAM than the FP16 baseline. Sometimes dramatically more. Llama 8B at 16K context used 22.8 GB with TQ vs 8.0 GB with FP16.

My theory: the WHT rotation machinery has a fixed overhead (lookup tables, rotation matrices, codebooks) that gets allocated regardless of context size. When the KV cache is small, this overhead dwarfs the compression savings. The crossover point where TQ starts winning varies by model:

Llama 8B: around 32K context
Qwen 14B: around 16K context
Qwen 32B: around 8K context

Larger models cross over earlier because their per-token KV cache is larger (more layers, more attention heads), so the compression pays off sooner.

When Is TurboQuant Worth It?

Use TQ3_0 when:

You need 32K+ context on consumer GPUs
You're hitting VRAM limits and can't afford more hardware
Throughput isn't critical (batch processing, RAG with long documents, analysis tasks)
You're running a large model (32B+) where the crossover point is lower

Don't use TQ3_0 when:

Context is under 16K (you'll actually use more VRAM)
You need interactive throughput (the 5x penalty makes chat unusable)
You're on Blackwell and want optimal performance (wait for SM 120-optimized kernels)

The sweet spot in my testing was Qwen 32B at 32K context. Baseline uses 28 GB, which is dangerously close to my 32 GB ceiling. One concurrent request could OOM. TQ drops it to 21.4 GB, leaving over 10 GB of headroom for parallel slots or longer contexts.

What's Next

The throughput penalty is the main blocker. The animehacker CUDA kernels use a fused MMVQ approach that avoids dequantization during attention, but the WHT butterfly transform still runs 160 integer ops per element in registers. On Blackwell with its new SM architecture, these kernels likely aren't hitting optimal occupancy.

Things I'm watching:

PR #21089 on ggml-org/llama.cpp, the only open upstream PR for TurboQuant (CPU-only for now)
Whether ggerganov engages with it. If he requests changes rather than closing, it'll eventually land.
SM 120-optimized CUDA kernels. Blackwell has new instructions that could close the throughput gap.

For LLMKube, I'm planning to add cacheTypeK and cacheTypeV fields to the InferenceService CRD so users can configure this without the wrapper entrypoint hack. Also an extraArgs escape hatch for any llama.cpp flag we don't have a typed field for yet.

Try It Yourself

All the benchmarking infrastructure is in the LLMKube repo. The operator is open source (Apache 2.0) and handles the full lifecycle: model downloads, GPU scheduling, multi-GPU sharding, health probes, and Prometheus metrics. If you have a GPU cluster and want to test TurboQuant:

Build the custom image from animehacker/llama-turboquant with -DGGML_CUDA=ON
Set spec.image on your InferenceService to point at it
The wrapper entrypoint handles the rest

If you run these benchmarks on different hardware (A100, RTX 3090, etc.), I'd love to see the numbers. Drop a comment or find me on the LLMKube Discord.

Benchmarks run on 2026-03-30 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.0).

The $0 Problem: Why Every Tool Says Your On-Prem Inference is Free

Christopher Maher — Mon, 23 Mar 2026 15:49:13 +0000

If you run LLMs on your own hardware, every cost tracking tool in the ecosystem has the same answer for what it costs: $0.

OpenCost sees your GPU pods but has no concept of tokens. LiteLLM tracks tokens per user but hardcodes on-prem cost to zero. Langfuse traces requests but only prices cloud APIs. The FinOps Foundation's own working group explicitly says on-premises AI cost is "outside the scope."

Meanwhile, your GPUs cost real money. The H100s draw 700 watts each. Your electricity bill is real. The three-year amortization on $280K of hardware is real. But no tool computes:

true cost per token = (hardware amortization + electricity x GPU power draw) / tokens per hour

We built InferCost to fix this.

What InferCost does

InferCost is an open-source Kubernetes operator (Apache 2.0) that computes the true cost of running AI inference on your own hardware. It's a single controller pod. No database, no UI to host. It plugs into Prometheus and Grafana you already run.

You declare your hardware economics in a CRD:

apiVersion: finops.infercost.ai/v1alpha1
kind: CostProfile
metadata:
  name: gpu-cluster
spec:
  hardware:
    gpuModel: "NVIDIA GeForce RTX 5060 Ti"
    gpuCount: 2
    purchasePriceUSD: 960
    amortizationYears: 3
  electricity:
    ratePerKWh: 0.08
    pueFactor: 1.0

InferCost reads real-time GPU power draw from DCGM, scrapes token counts from your inference engine (llama.cpp, vLLM), does the math, and tells you what your inference actually costs. Per model. Per team. Per token.

What we found on real hardware

We deployed InferCost on a homelab running Qwen3-32B on 2x RTX 5060 Ti GPUs. Here are the real numbers:

Hourly infrastructure cost: $0.053 (amortization + electricity at actual GPU power draw)
Cost per million tokens: $0.41 under sustained load
Monthly projected: $38

Then we compared against cloud APIs (verified pricing as of March 2026):

Provider	Cloud Cost	On-Prem Cost	Savings
Claude Opus 4.6	$9.82	$0.62	94%
GPT-5.4	$5.83	$0.62	89%
Gemini 2.5 Pro	$3.84	$0.62	84%
GPT-5.4-nano	$0.41	$0.62	Cloud 24% cheaper

That last row matters. When the cheapest cloud model is actually cheaper than your hardware, InferCost tells you. The point is not to prove on-prem always wins. The point is to give you the real numbers so you can decide.

A note on how we calculate cost

The $28/month on-prem number is your total infrastructure cost: hardware amortization plus electricity, running 24/7. Your GPUs cost money whether or not they're serving requests. The $0.41 per million tokens is the marginal cost during active inference (what each token costs when the system is busy).

The savings comparison uses total infrastructure cost because that's the honest number. If your GPUs sit idle half the time, that idle time still costs you. This is the same logic as any hardware TCO calculation: you amortize the full purchase price, not just the hours you used it.

This means your actual savings percentage depends on utilization. At high utilization (GPUs busy most of the day), the savings are dramatic. At low utilization, the math shifts toward cloud APIs for cheap models. InferCost shows you both realities so you can make the right call for each workload.

The CLI

$ brew install defilantech/tap/infercost
$ infercost compare --monthly

PROVIDER    MODEL              CLOUD/MONTH  ON-PREM/MONTH  SAVINGS/MONTH
Anthropic   claude-opus-4-6    $409         $28            $381 (93%)
OpenAI      gpt-5.4            $242         $28            $214 (88%)
Google      gemini-2.5-pro     $159         $28            $131 (82%)
Google      gemini-2.5-flash   $40          $28            $12 (30%)
OpenAI      gpt-5.4-nano       $20          $28            -$8 (cloud cheaper)

What InferCost is NOT

It is not a cloud API cost tracker. If you want to monitor your OpenAI bill, tools like Helicone and LangSmith do that well. InferCost solves a different problem: the cost of running inference on hardware you own, where the economics involve amortization schedules and electricity bills, not API invoices.

It is also not locked to any specific inference stack. It works with LLMKube, but also with any Kubernetes deployment that runs llama.cpp or vLLM with Prometheus metrics exposed.

Why open source

The organizations that need on-prem cost tracking the most (healthcare, defense, finance, government) are the same ones that can't send cost data to a SaaS dashboard. They chose on-prem for data sovereignty. A cost tracking tool that phones home defeats the purpose.

InferCost runs entirely in your cluster. Your cost data never leaves your infrastructure. Apache 2.0, no telemetry, no cloud dependency.

Get started

# Install the CLI
brew install defilantech/tap/infercost

# Or deploy via Helm
helm repo add infercost https://defilantech.github.io/infercost
helm install infercost infercost/infercost \
  --set dcgm.endpoint=http://dcgm-exporter:9400/metrics

GitHub: github.com/defilantech/infercost
Website: infercost.ai
Companion project: LLMKube (K8s operator for LLM inference)

If you're running inference on your own hardware and want to know what it actually costs, give it a try. Issues and PRs welcome.

llama.cpp on Kubernetes: The Guide I Wish Existed

Christopher Maher — Tue, 17 Mar 2026 06:50:51 +0000

It started at my kitchen table.

I was spending an evening on my laptop, fascinated by how LLMs actually work under the hood. Not the API calls, not the chat interfaces, but the actual inference process. I installed Ollama on my Mac, pulled a model, and within a few hours I was completely hooked.

If you've done this yourself, you know the feeling. A language model running on your own hardware. No API keys, no usage limits, no data leaving your network. Just you and the model.

Ollama made it easy to get started, but I quickly wanted to understand what was happening underneath. That led me to llama.cpp, which Ollama uses under the hood, and that's where things really clicked. I could see exactly how the model was being loaded, how layers were offloaded to the GPU, how the inference loop worked. I went from curious to obsessed pretty quickly.

But then the questions started piling up.

How do I serve this to my team? How do I run multiple models? What happens when I want to use the NVIDIA GPUs on my Linux server AND the Metal GPU on my Mac? How do I monitor it? How do I manage model versions?

I come from a DevOps background, so my brain immediately went to Kubernetes. I figured someone had already built this. And while there are some incredible tools out there (Ollama for single-machine use, vLLM for high-throughput NVIDIA clusters), nothing quite did what I wanted: a Kubernetes operator that treats LLM inference as a first-class workload across heterogeneous hardware, including Apple Silicon.

So I started building LLMKube, an open-source Kubernetes operator for running LLMs with llama.cpp. I'm a big believer in open source, and I wanted this to be open source from day one. The best infrastructure tools are built by communities, not individuals. This guide is everything I've learned along the way.

What We're Building Toward

By the end of this guide, you'll understand how to:

Run llama.cpp on Kubernetes with proper lifecycle management
Deploy models with a single command or a two-resource YAML
Use NVIDIA GPUs with CUDA acceleration
Use Apple Silicon Macs as GPU inference nodes in your cluster
Split models across multiple GPUs for larger models
Monitor everything with Prometheus and Grafana

If you just want to try it out quickly, skip ahead to the hands-on quickstart.

The Problem with "Just Run llama.cpp"

llama.cpp is an outstanding project. It runs on virtually any hardware, supports dozens of model architectures, and the GGUF format has become the standard for local inference. If you need to run one model on one machine, llama.cpp with llama-server is honestly all you need.

The challenges show up when you want to operationalize it:

Model lifecycle. You need to download models, verify their integrity, cache them so pods don't re-download 30GB files on every restart, and keep track of what's deployed where.

GPU scheduling. If you have multiple models competing for limited GPU memory, you need something smarter than "first pod wins." Priority queues, memory budgets, and graceful handling of GPU contention all matter when you have real workloads.

Heterogeneous hardware. This is the big one. Apple Silicon's Metal GPU can't be accessed from inside a container. Every Kubernetes-based LLM tool I found either ignored Macs entirely or ran them in CPU-only mode, which throws away the best part of the hardware. If you have a Mac Studio with an M4 Ultra sitting on your desk and a Linux server with NVIDIA GPUs in your closet, you shouldn't have to choose between them.

Observability. If you're already running Prometheus and Grafana (and if you're running Kubernetes, you probably are), you want inference metrics in the same stack as everything else. Tokens per second, prompt processing time, GPU utilization, model load times, all in one place.

How LLMKube Approaches This

LLMKube adds two Custom Resource Definitions to your Kubernetes cluster:

Model defines what you want to run: the GGUF source URL, quantization level, GPU requirements, and hardware preferences.

InferenceService defines how you want to run it: replicas, resource limits, endpoint configuration, and which Model to reference.

The operator watches these resources and handles everything in between: downloading the model, creating deployments, configuring health checks, setting up llama-server with the right flags, exposing an OpenAI-compatible API, and cleaning up when you delete resources.

apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
  name: llama-3-8b
spec:
  source: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
  format: gguf
  quantization: Q4_K_M
  hardware:
    accelerator: cuda
    gpu:
      count: 1
---
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
  name: llama-3-8b
spec:
  modelRef: llama-3-8b
  replicas: 1
  resources:
    cpu: "2"
    memory: "4Gi"

That's it. The operator takes it from there.

My Actual Setup

I want to be transparent about the hardware I run this on, because I think it's important for people to see that you don't need datacenter-grade equipment to make this work.

Shadowstack is my primary inference server. It's a desktop PC I built specifically for this:

AMD Ryzen 9 7900X (12 cores / 24 threads)
64GB DDR5-6000
2x NVIDIA RTX 5060 Ti (16GB VRAM each, 32GB total)
Samsung 990 Pro 1TB NVMe
Running MicroK8s as a single-node Kubernetes cluster

Mac Studio (M4 Ultra, 36GB unified memory) runs the Metal Agent, which lets Kubernetes orchestrate llama-server natively on macOS with full Metal GPU access.

Mac Mini handles other orchestration workloads.

On Shadowstack, I run Qwen3 32B with the model split across both 5060 Tis using tensor parallelism. On the Mac Studio, I run Qwen 30B-A3B (a mixture-of-experts model that fits comfortably in 36GB of unified memory). Both are managed by the same LLMKube operator, using the same CRDs, visible through the same monitoring stack.

Is 36GB of unified memory on the Mac Studio less than I wish I had? Sure. But it still runs a 30B MoE model for real workloads, and that's the point. You work with the hardware you have.

The Metal Agent: Running Apple Silicon in Your Cluster

This is the part that gets me the most excited, and the part that I haven't seen anyone else solve.

Here's the core problem: Apple Silicon GPUs use Metal, not CUDA. Metal isn't accessible from inside a Docker container. So if you put a Mac in your Kubernetes cluster and deploy a pod to it, that pod can only use the CPU. Your M4 Ultra's GPU sits idle.

The Metal Agent works around this by inverting the typical Kubernetes model. Instead of running inference inside a container, the Metal Agent runs as a native macOS daemon that:

Watches the Kubernetes API for InferenceService resources with accelerator: metal
Spawns llama-server natively on macOS with full Metal GPU access
Registers the endpoint back into Kubernetes so other services can route to it

From the perspective of any other service in your cluster, the model running on your Mac looks like any other Kubernetes-managed endpoint. You can hit the same OpenAI-compatible API, the same health checks work, the same Prometheus metrics are exposed.

# On your Mac
brew install llama.cpp
llmkube-metal-agent --host-ip 192.168.1.x

# From anywhere in the cluster
llmkube deploy qwen-30b-a3b --accelerator metal

The same CRD that deploys a model on NVIDIA with CUDA deploys on Apple Silicon with Metal. Just change accelerator: cuda to accelerator: metal.

Multi-GPU: Splitting Models Across Cards

If you want to run models larger than what fits on a single GPU, llama.cpp supports tensor parallelism across multiple GPUs on the same node. LLMKube automates this through the GPU sharding spec.

On my Shadowstack box, Qwen3 32B (quantized to Q4_K_M, roughly 20GB) gets split across both 5060 Tis. Each GPU handles a portion of the model's layers, and llama.cpp coordinates the inference across both cards.

spec:
  hardware:
    accelerator: cuda
    gpu:
      count: 2
      sharding:
        strategy: layer

The operator automatically calculates the tensor split ratios and passes the right flags to llama-server. On the dual 5060 Ti setup, I see consistent ~53 tokens/second for 3-8B models and solid performance on the 32B model with the split.

Hands-On: Try It in 10 Minutes

You don't need my hardware to try this. Here's the quickest path from zero to running inference on Kubernetes.

Prerequisites

A Kubernetes cluster (Minikube, kind, K3s, or any managed cluster)
kubectl configured
Helm 3

Install LLMKube

# Install the CLI
brew install defilantech/tap/llmkube

# Add the Helm repo and install the operator
helm repo add llmkube https://defilantech.github.io/LLMKube
helm install llmkube llmkube/llmkube \
  --namespace llmkube-system \
  --create-namespace

Deploy Your First Model

# Deploy Phi-4 Mini (3.8B params, from the built-in catalog)
llmkube deploy phi-4-mini

That single command creates both the Model and InferenceService resources. The operator downloads the GGUF file, spins up a pod with llama-server, and exposes an OpenAI-compatible API. You can also deploy any GGUF model by providing a --source URL pointing to HuggingFace or any HTTP endpoint.

Query It

# Port-forward and test
kubectl port-forward svc/phi-4-mini 8080:8080 &

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is Kubernetes in one sentence?"}
    ],
    "max_tokens": 100
  }'

Use It With the OpenAI SDK

Since the API is OpenAI-compatible, you can point any OpenAI SDK client at it:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

This works with LangChain, LlamaIndex, and anything else that speaks the OpenAI API.

Add GPU Acceleration

If you have an NVIDIA GPU available in your cluster:

llmkube deploy llama-3.1-8b --gpu --gpu-count 1

The difference is dramatic. On an NVIDIA L4 in GKE, prompt processing goes from 29 tok/s (CPU) to 1,026 tok/s (GPU). Token generation jumps from 4.6 tok/s to 64 tok/s. That's a 17x speedup on generation and 66x on prompt processing.

Air-Gapped Deployments

Early in my career, I worked in medical IT. That experience gave me an appreciation for environments where data simply cannot leave the network. Healthcare, defense, finance, government: these industries have strict compliance requirements that make cloud-hosted AI a non-starter.

LLMKube supports air-gapped deployment through PVC-based model sources with SHA256 integrity verification:

spec:
  source: pvc://model-storage/models/llama-3-8b-q4.gguf
  sha256: a1b2c3d4e5f6...

You stage models to a PersistentVolumeClaim, provide the checksum, and the operator verifies integrity before deploying. No outbound network calls, no container registry pulls at runtime, no data leaving your network.

This is an area where I think llama.cpp really shines for Kubernetes deployments. The GGUF format is a single file. There's no Python dependency tree, no model sharding across dozens of files, no runtime downloads of tokenizers. You put one file on a PVC, point a CRD at it, and you're running.

Where LLMKube Fits (and Where It Doesn't)

I want to be honest about this, because there are great tools in this space and picking the right one matters.

If you need maximum throughput for high-concurrency workloads (50+ simultaneous users), use vLLM or SGLang. They use PagedAttention, continuous batching, and other optimizations that llama.cpp doesn't have. At scale, vLLM delivers significantly higher request throughput. That's just the reality.

If you just need to run one model on one machine, use Ollama. It's simpler, it's elegant, and it handles the single-machine case better than a Kubernetes operator ever will.

LLMKube is for the space in between. You have a Kubernetes cluster. You have a mix of hardware (maybe NVIDIA GPUs, maybe Apple Silicon, maybe both). You want Kubernetes-native lifecycle management with CRDs, GitOps workflows, and your inference metrics in the same Prometheus/Grafana stack as everything else. You care about air-gapped deployments, GPU scheduling, and model versioning. You're serving a team or a set of internal workloads, not a public-facing API with thousands of concurrent users.

If that sounds like your situation, LLMKube might be what you're looking for. If it doesn't, I genuinely hope one of the other tools solves your problem. We all benefit from this ecosystem getting better.

What's Next

LLMKube is open source (Apache 2.0) and actively developed. Some things I'm excited about on the roadmap:

Edge deployment support for lightweight Kubernetes distributions like K3s and MicroK8s
AMD GPU support (ROCm) with a community contributor already testing on Framework hardware with a Ryzen AI Max+ 395
llmkube chat for testing models directly from the CLI without needing curl

I'll be honest about one thing that comes up a lot: multi-node distributed inference. llama.cpp has an RPC backend that can split a model across machines over ethernet, and I've been watching it closely. The reality is that over consumer networking (1GbE, 2.5GbE), the performance hit from network round-trips makes it marginal for interactive use. Jeff Geerling tested a four-node Framework cluster and got 0.7 tok/s on Llama 405B. The tech is improving, but today my advice is to scale vertically first. Get a bigger GPU or more unified memory before trying to split across machines. If the RPC backend matures to the point where it's genuinely usable over ethernet, LLMKube will support it, but I'm not going to promise something that isn't ready.

If any of this is interesting to you, I'd love to hear from you. The project is at github.com/defilantech/LLMKube, and we have a Discord where I hang out and talk about this stuff regularly.

If you hit issues, open a GitHub issue. If you want to contribute, check the issues labeled good-first-issue. And if you just want to say hi, that's cool too.

Thanks for reading. I hope this saves you some of the time I spent figuring all this out.

DEV Community: Christopher Maher

TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

TL;DR

Quality eval: perplexity and KL divergence

Asymmetric K/V: which combos work, which don't

-ctk q8_0 -ctv turbo4: the new long-context winner

-ctk q8_0 -ctv turbo3: similar trick, worse decode

-ctk f16 -ctv turbo4: kernel fallback, do not use

The 64K row: filling the gap

Updated cache-type recommendations

Caveats

What's still in flight

Methodology

TurboQuant on a MacBook Pro: two findings the upstream discussion missed

TL;DR

Why KV cache, why now

The bench

The numbers

Generation throughput (tok/s)

Prompt processing throughput (tok/s)

Finding 1: turbo3 beats q8_0 at long context

Finding 2: turbo3 and turbo4 split by workload phase

What this enables on a MacBook

Caveats

What we contributed back

How to try it yourself

Open invitation

62.2% on Aider Polyglot from a MacBook Pro. Then the other model we tried scored 4%. Here's what actually happened, with a working cost loop attached.

TL;DR

1. The hardware and what's special about it

2. Qwen3.6-35B-A3B Q8 on Aider Polyglot

The headline result

Per-language

3. The runaway-reasoning failure mode (and the resume that closed it out)

4. The other thing we wanted to test

5. Investigation

6. Same model, three benchmarks, three answers

7. InferCost was running the whole time

8. The cost economics

9. What we shipped today, and how to use it

10. Reproducibility

What's next

We ran Qwen3.6-27B on $800 of consumer GPUs, day one: llama.cpp vs vLLM

TL;DR

1. Why we did this

2. Hardware and the constraint

3. The first attempt that didn't work

4. Method

llama.cpp candidate

vLLM candidate

Workloads

5. Results: throughput and latency

Single-request latency (c=1)

Throughput under load (c=64)

Inter-token latency

The honest version

6. Results: context

The 5K baseline

Which brings us to the real test

The real tradeoff

7. What it costs

The two numbers

8. Cloud comparison

9. Reproduce it yourself

10. What's next

LLMKube Now Deploys Any Inference Engine, Not Just llama.cpp

The Problem

The Fix

Testing It: PersonaPlex on Kubernetes

Built-in vLLM Runtime

Built-in TGI Runtime

The Generic Runtime

Per-Runtime Autoscaling

Adding Your Own Runtime

Everything Else in v0.6.0

What's Next

I tested speculative decoding on my home GPU cluster. Here's why it didn't help.

The setup

What I tested

The result that fooled me

`-ctk q8_0 -ctv turbo4`: the new long-context winner

`-ctk q8_0 -ctv turbo3`: similar trick, worse decode

`-ctk f16 -ctv turbo4`: kernel fallback, do not use