DEV Community: Jeff Geiser

Q8_0 isn't slow because of swap

Jeff Geiser — Tue, 19 May 2026 13:33:47 +0000

A complete quantization benchmark for Llama 3.1 8B on Apple M4 16GB — speed and perplexity

I’ve been building an account intelligence model — a fine-tuned system that pulls from Salesforce, Confluence, Slack and some internal systems and grabs everything worth knowing about a customer account into structured JSON. The kind of thing that normally takes some time to generate - and can’t be generated from within any one of those - but takes time. Hoping to make this more efficient. Very isolated use case, but is interesting. I think these highly specialized local models will likely dominate enterprise architectures.. I think..

Part of that project is deciding which local models are viable in production — specifically, which quantization level makes sense for the 7B distilled model I’m eventually releasing.. I decided to use a basic mac mini as the reference architecture.

I built an automated benchmark harness, ran Llama 3.1 8B through all 11 quantization levels on an Apple M4 Mac Mini with 16GB unified memory, and measured three things per quant: token generation speed, perplexity on Wikitext-2, and swap behavior.

I expected a smooth quality/speed tradeoff curve. I got a cliff, a plateau, and a finding I had to re-run twice to believe - thought it was an error..

Q8_0 — the highest-quality quantization I tested — produced 0.13 tokens per second.

Not because it was swapping. The re-run shows swap_any: false. No swap at all. The model fits cleanly in 16GB unified memory. It just doesn't run fast.

The explanation: 8.5GB of 8-bit weights saturates the M4's unified memory bandwidth during inference. The GPU shows 80% utilization — not because it's computing, but because it's moving weights from memory to compute units. At 8-bit precision, the weight transfer bottleneck dominates everything else. There's no swap to blame. The constraint is the memory bus.

This is an architectural property of the hardware, not a configuration problem. You can't fix it by closing other apps or adding more swap. Q8_0 on a 16GB M4 is slow by design.

What swaps and what doesn’t

Five of eleven quants hit disk during the benchmark. The clean ones — no swap, usable speed:

IQ3_XS — 3.5 GB — 13.0 tok/s
Q3_K_M — 4.0 GB — 10.1 tok/s
Q4_K_M — 4.9 GB — 19.7 tok/s ← sweet spot
Q6_K — 6.6 GB — 16.3 tok/s ← quality ceiling

Everything above Q6_K either swaps or hits the bandwidth wall. The jump from Q6_K to Q8_0 (6.6 GB → 8.5 GB) doesn’t just add size — it crosses the threshold where the memory bus can’t keep up.

One weird thing worth calling out: Q5_K_S hit 6.96 tok/s per watt — the best efficiency number in the entire set. But it’s swapping. Low GPU power (2.4W average) plus active swap means the work is happening on CPU and memory bandwidth, not GPU. A misleading number if you only look at efficiency without checking swap.

Now add perplexity

Speed tells you how fast the model runs. Perplexity tells you how much quality you’ve traded away to get there. Lower perplexity = better quality. Q8_0 is the baseline.

Q2_K — PPL 11.15 — +29% worse than Q8_0
Q3_K_M — PPL 9.21 — +6.4% worse
Q4_K_M — PPL 8.80 — +1.7% worse
Q5_K_M — PPL 8.73 — +0.9% worse
Q6_K — PPL 8.68 — +0.4% worse
Q8_0 — PPL 8.66 — baseline

The quality cliff is at Q3, not Q4. Q4_K_M to Q8_0 is essentially flat — a 1.7% perplexity difference you would not notice on any real task. Q3_K_M jumps to +6.4%, which is detectable. Q2_K at +29% is a last resort.

The practical decision range is Q4_K_M to Q6_K. Everything in that band delivers 98–100% of Q8_0 quality at usable speed with no swap.

Q4_K_M gives you 98.3% of Q8_0 quality at 152× the speed.

The account intelligence model I'm building runs on two very different pieces of hardware depending on what it's doing.

The fine-tuning and eval work runs on a DGX Spark — Qwen2.5-32B in FP8, 64GB VRAM. That's not the model I'm releasing.

The distilled 7B model — the one anyone can run locally — needs to work on a Mac Mini, a single-GPU workstation, a team server. For that audience, shipping Q8_0 would be a mistake. Someone tries it on 16GB, gets 0.13 tok/s, and concludes the model is broken. That's a distribution failure I can prevent by choosing the right quantization before release.

Target: Q4_K_M GGUF. Fits in 16GB with room to spare. 19+ tok/s. 1.7% quality loss. No surprises at deployment.

What’s next

This benchmark covers one model on one hardware configuration. The next round is Qwen3.6 — a 27B dense model — on the same M4 setup. The questions are different: at 27B, the “safe” quant range on 16GB is much narrower. And the Qwen3 family has a thinking-mode variable (enable_thinking=false reliability) that adds a dimension Llama doesn’t have.

I’ll also (maybe) run the same harness on the DGX Spark for a direct comparison: what does a Mac Mini get you vs enterprise inference hardware on the same model family? Would be nice but it’s a shared machine so I need to be careful with the workloads.. might just spin up a vps/bmc..

Next post: the synthetic data generator for the account intelligence model — what broke in the first smoke test (62 schema validation errors on run 1), how the retry loop works, and what we learned from building 8 gold examples by hand from real customer data.

Had claude pump out some graphs:

Local LLMs in Production: Squeezing Qwen to Match Claude

Jeff Geiser — Tue, 19 May 2026 13:29:38 +0000

Lessons from the DGX Spark: Speed, VRAM, and the "Thinking" Problem

We have a DGX Spark at the office everyone fights over.. dying to play with it.. had a simple goal: build an internal automation agent that peers into Salesforce, Confluence, and our internal APIs to generate workflows, pricing quotes, etc. Keep sensitive data local and, frankly, kill the API costs as much as possible.

But as you know, “running it locally” is not straightforward.. many times I wanted to just throw the key in .env and be done with it. Here’s what we learned from the trenches of model selection, VRAM management, and prompt tuning.

The “Thinking” Tax: Why We Pivoted from Qwen 3.6

The first instinct was to grab the newest shiny object: Qwen3.6-27B. It’s a beast on paper, but we ran into an immediate “personality” issue. The model has a heavy “scratchpad” style—it wants to think out loud before it gives you the answer.

For our use case—generating clean JSON for an internal UI—this was a disaster. It burned tokens and time on analysis we didn’t ask for. We tried enable_thinking=false, but it wasn’t consistent. We moved to Qwen3-30B-A3B and hit the same wall.

So! if you just need a model to follow tool calls and return a schema, “thinking” models can actually be a hindrance. You don’t need a philosopher; you need a clerk.

The Sweet Spot: Qwen2.5-32B-Instruct-fp8

We eventually landed on Qwen2.5-32B-Instruct-fp8.

TThe FP8 quantization allowed it to sit comfortably in the Spark’s VRAM, even with our embedding model (BGE-M3) running alongside it.

In head-to-head evals against Claude 3.5 Sonnet, the latency difference was a little surprising.

The Benchmarks (22 Paired Evals)

Metric. Qwen2.5-32B (Local) Claude 3.5 Sonnet (Cloud)
TTFT 1–2s 9–35s
Response Concise 2.3x longer

Claude is impressive—it adds citations and caveats that Qwen just doesn’t match—but for routine synthesis, 35 seconds for a first token is a non-starter for a snappy UI.

Closing the Quality Gap: The “Schema-First” Strategy

Qwen was fast, but it was “hallucination-prone”—dropping schema fields and making up URLs. To fix this, we stopped treating it like a chatbot and started treating it like a compiler.

Our Optimization Stack:

Temperature 0.1: Kill the creativity.

Schema-First Prompting: We moved the JSON structure to the very top of the prompt. We tell it how to output before we tell it what to do.

Hard Constraints: We added rules like empty section = [] and a strict Never fabricate command.

Zero Persona: We stripped all “You are a helpful assistant” fluff. It just gets in the way of the logic.

The Hardware Squeeze

One thing to watch if you’re running on a Spark: VRAM is a zero-sum game, obviously. Adding BGE-M3 for semantic search and multilingual support was non-negotiable for our data, but it made the memory overhead incredibly tight.

What’s Next?

We’re going to run a full eval on these changes to see if the prompt tuning is enough. If not, the next step is building a middleware layer to catch malformed JSON and trigger second calls. I’m also looking at putting Llama 3 through its paces to see if the tool-calling is more robust.

The Bottom Line: We’re closing the gap. We’ll use the local Qwen for the 90% “routine” synthesis and save the Claude API calls for the truly hard reasoning tasks.

WES: Why Tokens Per Watt Isn't Enough for Edge Inference

Jeff Geiser — Wed, 11 Mar 2026 16:33:17 +0000

Edge inference is still nascent.

I work at Zenlayer helping companies deploy compute in hard to reach places. Spin up a VM with Ollama, pull a model, running inference in minutes. The infrastructure is there. The tooling is maturing. But the metrics for understanding what's actually happening on those nodes is still catching up.

I've also been building Wicklee in my weekend time — a sovereign GPU fleet monitor written in Rust with an embedded React dashboard. Running a mixed fleet of Apple Silicon and AMD CPU nodes, I kept running into the same problem: the standard metrics weren't telling me quite enough..

Everyone in AI talks about efficiency at the data center level. Jensen talks tokens per watt. Google reports Gemini in watt-hours. Microsoft targets 8-20x energy reductions per query. Great work — but these are hyperscaler metrics, built for environments with precision cooling, facilities teams, and controlled everything.

That's not edge inference.

Here's a scenario that'll be familiar if you're running a distributed fleet:

tok/s drops slightly
board power creeps up slightly
thermal state moves from Normal to Fair

You do get a drop in tokens/watt. But now what? Is it a blip? Is it meaningful? What do you chase?

Honest answer: hard to tell. Tokens/watt can't distinguish a legitimate workload increase from thermal throttling. They look identical in the number. One means the node is doing its job. The other means it's quietly degrading and inserting some wait sates to prevent reaching a critical state. They require completely different responses.

And in practice, a 15% drift on one node in a six-node fleet at 2am looks like noise. You don't chase it. The node keeps running. Keeps burning power. Keeps delivering worse inference. Until something obvious breaks.

The data was there. Nothing put it in front of you.

In other cases, the token/watt metric can stand still even if token/s is dropping if power output is also dropping. So, efficiency looks stable but throughput is actually dropping.

So I wanted to add thermal state to the equation.

WES — the Wicklee Efficiency Score:
WES = tok/s ÷ (Watts_adjusted × ThermalPenalty)

The ThermalPenalty comes directly from what the device reports — on Apple Silicon that's IOPMCopyCPUPowerStatus via IOKit, on NVIDIA it's the nvmlDeviceGetCurrentClocksThrottleReasons() bitmask. Not temperature guesses. Not externally imposed thresholds. The hardware's own classification of its thermal condition, amplified into the score.

Thermal StatePenaltyNormal1.0Fair1.25Serious1.75Critical2.0+

When thermals are clean, penalty is 1.0 and WES equals tokens/watt. When throttling starts, the penalty amplifies the drop — turning a subtle drift you'd dismiss as noise into something that screams at you.

Higher WES = better. Miles per gallon for inference.

Why the leaderboard is the real insight

WES on a single node is useful. The Wicklee fleet leaderboard is where it gets interesting.

Stack rank every node by WES. A thermally degraded node doesn't just show a number that drifted. It falls in the ranking. Drops below nodes it was beating yesterday. That positional change is impossible to miss — you don't need to be actively monitoring anything, you just notice your #1 node is now #3.

That's the moment tok/watt never creates.

WES surfaces the signal. The thermal panel explains the cause. Route requests to the top of the leaderboard and you're automatically routing away from degraded nodes without lifting a finger.

Real numbers from my fleet
Running llama3.2:3b via Ollama across hardware:

tok/s makes this look like a 6x gap. WES shows a 1,293x efficiency difference. The Ryzen is fast. It is not efficient.

Now throw the M2 into thermal throttling — WES drops from 181.5 to 83.6. Still #1 on the leaderboard. But the drop is visible. The thermal panel tells you why. WES made you notice. Thermal data gave you the diagnosis. They work together.

*Raw WES vs Penalized WES
*
Wicklee reports WES two ways:
Raw WES — ThermalPenalty forced to 1.0. Hardware ceiling under clean conditions. Essentially tok/watt.

Penalized WES — live thermal penalty applied. Operational reality.
The gap is your Thermal Cost — efficiency being lost to throttling right now:

Thermal Cost % = (Raw WES − Penalized WES) / Raw WES × 100

A node with Raw WES 181.5 and Penalized WES 83.6 is losing 54% of its potential efficiency to thermals. That's the number that drives action — not a raw temperature reading, not a wattage blip.

Here's the implementation if you want to compute it yourself:

javascriptconst THERMAL_PENALTIES = { Normal: 1.0, Fair: 1.25, Serious: 1.75, Critical: 2.0 };

// Clean node:
computeWESPair(108.9, 0.6, "Normal"); // → { raw: 181.5, penalized: 181.5, thermalCostPct: 0 }

// Throttled node:
computeWESPair(94.1, 0.9, "Fair"); // → { raw: 104.6, penalized: 83.6, thermalCostPct: 20 }
And in Rust for the monitoring agent side:
rustpub struct WESResult {
pub raw: Option,
pub penalized: Option,
pub thermal_cost_pct: Option,
}

pub fn compute_wes_pair(
tps: Option, watts: f64, thermal_penalty: f64, pue: f64
) -> WESResult {
let compute = |p: f64| tps.and_then(|t| {
let w = watts * pue;
if w <= 0.0 { return None; }
Some((t / (w * p) * 10.0).round() / 10.0)
});
let raw = compute(1.0);
let penalized = compute(thermal_penalty);
let thermal_cost_pct = raw.zip(penalized)
.map(|(r, p)| ((1.0 - p / r) * 100.0).round());
WESResult { raw, penalized, thermal_cost_pct }
}

WES is derived — compute it at render time from fields you're already collecting. No telemetry layer changes required.

How WES relates to existing work
Stanford and Together AI published "Intelligence per Watt" (IPW) last year — accuracy divided by power, measured offline against benchmarks. Solid research. It answers "what is this hardware capable of per watt?"

WES answers "what is it delivering right now?"

Raw WES and IPW are the same question from different vantage points — IPW from a benchmark lab, WES from live fleet telemetry. IPW tells you the ceiling. WES tells you how close you're running to it, under real thermal conditions, continuously.

**What's coming in Wicklee
**The Fleet WES Leaderboard is shipping soon — every node ranked by Penalized WES, Raw WES as a secondary column, Thermal Cost % visible at a glance.

After that, a series of benchmark posts:

Cross-platform WES benchmarks — Apple Silicon vs AMD CPU vs NVIDIA GPU, same model, same prompt. Raw WES per platform.
Thermal stress testing — deliberately inducing throttling and watching Raw vs Penalized WES diverge in real time.
Sustained load degradation — how long before each platform throttles, how fast does WES collapse when it does.
Edge enclosure testing — WES in a fanless case vs open air. Spoiler: not pretty.

Goal: a reproducible WES dataset across hardware. Not just a formula — empirical data behind it.
If you're running a local inference fleet, try computing your WES from the formula above and drop your numbers in the comments. Curious where different hardware lands.
Miles per gallon for inference. When you need a race car, gas be damned — go for it. But at the edge, efficiency wins.

Distributed Inference Observability gaps

Jeff Geiser — Fri, 16 Jan 2026 19:10:12 +0000

It seems that distributed inference observability has some gaps.

In terms of framing this, I am referring to inference deployments at the edge (or so called near edge).. pops close to end users. Let's say you are using ollama for some early testing and/or scaling but are using vllm in production.

Traditional monitoring platforms will report on GPU/CPU load, memory usage, network status, etc, etc.

However, other stuff is also happening:

GPU throttled - 100% utilization but clock speed dropped 33%
KV cache saturated causing some queue backlog
Time to first token spiked 200% from CPU contention
Another tenant's PCIe traffic impacted inference

maybe some contextual drift - some hardware stresses that degrade inference performance but it is happening in ways that is generally invisible to system metrics.

Most of the monitoring in the market is built for servers and takes a peek at intervals that may not make sense for inference

token generation:20-100 per second
cache saturation: spikes in seconds
thermal throttling happens instantly

but traditional monitoring might see this as smooth if only glancing at the server every 30 seconds. But, you also can't grab data every 2 seconds or you might contribute to some cpu scheduling pressure.

So, if you are going to run both ollama (for dev/test or smaller loads) and vLLM for production they have completely different failure modes but traditional monitoring would treat them the same.

We also have a blind spot with regard to time to first token (ttft) and time per output token (tpot). We might show request latency spiking, but we need to know whether ttft spiked or tpot spiked..

so, I am thinking about an open source project that would be a lightweight observability agent.. large companies will likely solve this by building a giant observability layer on top of their distributed inference solution -- but I think having a more bottoms up approach that can be deployed might make sense..

the observability agent would strive to:

have limited cpu impact/overhead
2 second sampling with some intelligent backoff
built in ttft/tpot splitting
contextual drift detection
works with vLLM/Prometheus and Ollama API stats
embedded DB storage (duckDB?) - no external dependencies
runs at edge.. maybe federates..

Curious to get feedback on where people are hitting observability gaps.. this is a new area for me to spend time on so curious about all feedback.

What are you doing to monitor vLLM and/or other inference engines?

What metrics do you wish you had?

Drop the war stories here.. thanks..

(apologies for lack of formatting.. maybe I will get better over time..)