This article was originally published on runaihome.com
TL;DR: Ollama v0.30 (May 13, 2026) promoted the MLX engine from a spring preview to the default Apple Silicon inference path, and the point releases through June added the parts that actually matter day-to-day: Gemma 4 QAT weights, Gemma 4 MTP speculative decoding (>2× on Macs), and better KV-cache reuse so repeated prompts skip re-prefill. If you're on the May preview build, upgrading to v0.30.10 is a free speed bump on an idle afternoon.
What you'll be able to do after this guide:
- Upgrade to Ollama v0.30.10 and confirm the MLX engine is actually running, not the old llama.cpp Metal fallback
- Pull the Gemma 4 QAT tags that fit your Mac's memory (1GB to 18GB) and run them at near-original quality
- Verify KV-cache reuse and MTP speculative decoding are active using
ollama psand real timing
Honest take: This isn't a new engine — it's the MLX preview we covered on June 2 finally stabilized and fed. The headline tok/s numbers haven't moved much since spring, but Gemma 4 QAT plus speculative decoding plus cache reuse together make a 32GB Mac feel meaningfully snappier on real multi-turn work. Upgrade, pull a
-it-qattag, and move on.
What the v0.30 line actually shipped
The MLX engine arrived in preview this spring as a swap of Ollama's Mac backend from llama.cpp's Metal path to Apple's MLX framework, which treats unified memory as the architectural primitive instead of an edge case. That preview nearly doubled decode speed — from ~58 to ~112 tok/s on an M4 Max running Qwen3.5-35B-A3B at int4 — but it was narrow: a handful of models, a hard 32GB-memory floor, and a "preview" label.
Ollama v0.30.0, released May 13, 2026, changed the framing. The release notes describe it as "improved compatibility and performance using llama.cpp" that augments the MLX engine on Apple Silicon, bringing support to a wider range of hardware. In plain terms: MLX is now the default fast path on capable Macs, and the llama.cpp side got broader GGUF support (Hugging Face models and your own fine-tunes) plus faster NVIDIA performance for everyone else.
The interesting work happened in the point releases:
| Version | Date | What it added |
|---|---|---|
| v0.30.0 | May 13, 2026 | MLX default on Apple Silicon; broader GGUF + Hugging Face model support; faster NVIDIA |
| v0.30.5 | early June 2026 | Fixed gemma4:12b floating-point exception crash; Gemma 4 MTP speculative decoding on Macs (>2× speedup)
|
| v0.30.8 | June 12, 2026 | Improved prompt caching for better KV-cache reuse |
| v0.30.9 | mid June 2026 | Cohere2Moe architecture support |
| v0.30.10 | June 17, 2026 | Command A and North family models on Apple Silicon MLX; llama.cpp updated to build 9672 |
If you installed the spring preview and never touched it, you're missing all four of those. None is a marketing bullet — they're the difference between "MLX is fast in a benchmark" and "MLX is fast on the thing I actually do."
Upgrade and verify it's really MLX
Upgrading is the easy part. On macOS, re-run the installer or use Homebrew:
$ brew upgrade ollama
$ ollama --version
ollama version is 0.30.10
The part people skip — and then wonder why nothing got faster — is confirming the MLX engine is the one doing the work. The MLX path activates on Macs with 32GB or more of unified memory. Below that, Ollama silently falls back to llama.cpp Metal with no error and no speed change. That silent fallback is the single most common "I upgraded and saw nothing" complaint, and it's not a bug — it's the documented memory floor.
To check which engine is live, load a model and read ollama ps:
$ ollama run gemma4:26b-it-qat ""
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
gemma4:26b-it-qat a1b2c3d4e5f6 16 GB 100% GPU 4 minutes from now
100% GPU means the model is fully on the GPU via the unified-memory path. If you see any CPU percentage on a model that should fit, you're either below the memory floor or the model spilled — close other apps and reload. The SIZE column also sanity-checks your quant: a 26B QAT model should report ~16GB, not ~30GB.
Gemma 4 QAT: the upgrade that changes which Mac is enough
The most useful thing v0.30 unlocked isn't raw speed — it's Google's Gemma 4 quantization-aware training (QAT) checkpoints, released June 5, 2026, now available as first-party Ollama tags. QAT simulates quantization during training instead of bolting it on afterward, which cuts memory roughly 72% versus BF16 while keeping near-original quality. We covered the full QAT memory map in the Gemma 4 QAT hardware update; here's the short version of what to pull:
$ ollama pull gemma4:e4b-it-qat # ~5 GB — fits a 16GB MacBook Air
$ ollama pull gemma4:12b-it-qat # ~7 GB — fits 16GB comfortably
$ ollama pull gemma4:26b-it-qat # ~15 GB — fits a 16GB Mac/GPU, barely
$ ollama pull gemma4:31b-it-qat # ~18 GB — needs 24GB+
| Gemma 4 QAT tag | Memory | What it fits |
|---|---|---|
gemma4:e2b-it-qat |
~1 GB | A phone, or any Mac |
gemma4:e4b-it-qat |
~5 GB | 8–16GB MacBook Air |
gemma4:12b-it-qat |
~7 GB | 16GB Mac / 8GB+ GPU |
gemma4:26b-it-qat |
~15 GB | 16GB Mac/GPU (tight) |
gemma4:31b-it-qat |
~18 GB | 24GB Mac/GPU |
The reason this matters: the 26B-A4B model now fits in ~15GB, which means a 16GB Mac that previously couldn't touch a 26B-class model runs one at near-full quality. Critical caveat carried over from the QAT release: don't hand-convert the Hugging Face QAT BF16 weights to Q4_0 yourself — the F16-vs-BF16 scale mismatch reintroduces the exact accuracy loss QAT was meant to avoid. Use the official Ollama -it-qat tags above, which are already converted correctly.
Speculative decoding and cache reuse: where v0.30 feels faster
Two changes in the point releases don't show up as a bigger headline tok/s number but change the lived experience.
Gemma 4 MTP speculative decoding (v0.30.5) uses multi-token-prediction draft heads to propose several tokens at once and verify them in a single pass — lossless output, but Ollama reports over a 2× speedup on Macs for Gemma 4. This is the same family of technique we broke down in why local LLMs got good in 2026: it doesn't raise the memory-bandwidth ceiling, it just wastes fewer trips to it.
KV-cache reuse (v0.30.8) is the quieter win. Before, sending a follow-up message in a long chat re-processed the entire prompt history (the prefill step) every turn. With improved prompt caching, an unchanged prefix is reused, so on a multi-turn conversation the second and later turns skip straight to generation. The bigger your system prompt and the longer your chat, the more time-to-first-token you save — on a long coding session with a 4K-token system prompt, that's the difference between a visible pause and an instant reply on every turn.
You won't see a flag for this. The way to confirm it's helping is crude but honest: time two identical follow-up prompts in the same session. The second should start streaming noticeably sooner because the shared prefix is already cached.
Real numbers, and the ceiling that didn't move
Here's what to actually expect, because "2× faster" is only true in specific places:
| Mac / model | Backend | Decode | Notes |
|---|---|---|---|
| M4 Max, Qwen3.5-35B-A3B int4 | MLX | ~112 tok/s | vs ~58 tok/s on the old Metal path (~93% gain) |
| M4 Max, optimized 7B | MLX | ~230 tok/s | small models show MLX's biggest lead |
| M3 Ultra, Gemma 4 27B Q4_K_M | MLX | ~30–42 tok/s | prefill ~700–900 tok/s |
| M3 Ultra, Qwen3.6 30B-A3B | MLX | >80 tok/s | MoE sparsity (3B active) is why it's 2× the dense 27B |
The pattern worth internalizing: MLX leads llama.cpp by **roughly 10–25% on most models, and up to 21–87% o
Top comments (0)