Ollama v0.30 on Apple Silicon: What the Stable MLX Release Actually Changed From the Preview

#ollama #applesilicon #mlx #localai

This article was originally published on runaihome.com

TL;DR: Ollama v0.30 (May 13, 2026) promoted the MLX engine from a spring preview to the default Apple Silicon inference path, and the point releases through June added the parts that actually matter day-to-day: Gemma 4 QAT weights, Gemma 4 MTP speculative decoding (>2× on Macs), and better KV-cache reuse so repeated prompts skip re-prefill. If you're on the May preview build, upgrading to v0.30.10 is a free speed bump on an idle afternoon.

What you'll be able to do after this guide:

Upgrade to Ollama v0.30.10 and confirm the MLX engine is actually running, not the old llama.cpp Metal fallback
Pull the Gemma 4 QAT tags that fit your Mac's memory (1GB to 18GB) and run them at near-original quality
Verify KV-cache reuse and MTP speculative decoding are active using ollama ps and real timing

Honest take: This isn't a new engine — it's the MLX preview we covered on June 2 finally stabilized and fed. The headline tok/s numbers haven't moved much since spring, but Gemma 4 QAT plus speculative decoding plus cache reuse together make a 32GB Mac feel meaningfully snappier on real multi-turn work. Upgrade, pull a -it-qat tag, and move on.

What the v0.30 line actually shipped

The MLX engine arrived in preview this spring as a swap of Ollama's Mac backend from llama.cpp's Metal path to Apple's MLX framework, which treats unified memory as the architectural primitive instead of an edge case. That preview nearly doubled decode speed — from ~58 to ~112 tok/s on an M4 Max running Qwen3.5-35B-A3B at int4 — but it was narrow: a handful of models, a hard 32GB-memory floor, and a "preview" label.

Ollama v0.30.0, released May 13, 2026, changed the framing. The release notes describe it as "improved compatibility and performance using llama.cpp" that augments the MLX engine on Apple Silicon, bringing support to a wider range of hardware. In plain terms: MLX is now the default fast path on capable Macs, and the llama.cpp side got broader GGUF support (Hugging Face models and your own fine-tunes) plus faster NVIDIA performance for everyone else.

The interesting work happened in the point releases:

Version	Date	What it added
v0.30.0	May 13, 2026	MLX default on Apple Silicon; broader GGUF + Hugging Face model support; faster NVIDIA
v0.30.5	early June 2026	Fixed `gemma4:12b` floating-point exception crash; Gemma 4 MTP speculative decoding on Macs (>2× speedup)
v0.30.8	June 12, 2026	Improved prompt caching for better KV-cache reuse
v0.30.9	mid June 2026	Cohere2Moe architecture support
v0.30.10	June 17, 2026	Command A and North family models on Apple Silicon MLX; llama.cpp updated to build 9672

If you installed the spring preview and never touched it, you're missing all four of those. None is a marketing bullet — they're the difference between "MLX is fast in a benchmark" and "MLX is fast on the thing I actually do."

Upgrade and verify it's really MLX

Upgrading is the easy part. On macOS, re-run the installer or use Homebrew:

$ brew upgrade ollama
$ ollama --version
ollama version is 0.30.10

The part people skip — and then wonder why nothing got faster — is confirming the MLX engine is the one doing the work. The MLX path activates on Macs with 32GB or more of unified memory. Below that, Ollama silently falls back to llama.cpp Metal with no error and no speed change. That silent fallback is the single most common "I upgraded and saw nothing" complaint, and it's not a bug — it's the documented memory floor.

To check which engine is live, load a model and read ollama ps:

$ ollama run gemma4:26b-it-qat ""
$ ollama ps
NAME                 ID              SIZE     PROCESSOR    UNTIL
gemma4:26b-it-qat    a1b2c3d4e5f6    16 GB    100% GPU     4 minutes from now

100% GPU means the model is fully on the GPU via the unified-memory path. If you see any CPU percentage on a model that should fit, you're either below the memory floor or the model spilled — close other apps and reload. The SIZE column also sanity-checks your quant: a 26B QAT model should report ~16GB, not ~30GB.

Gemma 4 QAT: the upgrade that changes which Mac is enough

The most useful thing v0.30 unlocked isn't raw speed — it's Google's Gemma 4 quantization-aware training (QAT) checkpoints, released June 5, 2026, now available as first-party Ollama tags. QAT simulates quantization during training instead of bolting it on afterward, which cuts memory roughly 72% versus BF16 while keeping near-original quality. We covered the full QAT memory map in the Gemma 4 QAT hardware update; here's the short version of what to pull:

$ ollama pull gemma4:e4b-it-qat    # ~5 GB  — fits a 16GB MacBook Air
$ ollama pull gemma4:12b-it-qat    # ~7 GB  — fits 16GB comfortably
$ ollama pull gemma4:26b-it-qat    # ~15 GB — fits a 16GB Mac/GPU, barely
$ ollama pull gemma4:31b-it-qat    # ~18 GB — needs 24GB+

Gemma 4 QAT tag	Memory	What it fits
`gemma4:e2b-it-qat`	~1 GB	A phone, or any Mac
`gemma4:e4b-it-qat`	~5 GB	8–16GB MacBook Air
`gemma4:12b-it-qat`	~7 GB	16GB Mac / 8GB+ GPU
`gemma4:26b-it-qat`	~15 GB	16GB Mac/GPU (tight)
`gemma4:31b-it-qat`	~18 GB	24GB Mac/GPU

The reason this matters: the 26B-A4B model now fits in ~15GB, which means a 16GB Mac that previously couldn't touch a 26B-class model runs one at near-full quality. Critical caveat carried over from the QAT release: don't hand-convert the Hugging Face QAT BF16 weights to Q4_0 yourself — the F16-vs-BF16 scale mismatch reintroduces the exact accuracy loss QAT was meant to avoid. Use the official Ollama -it-qat tags above, which are already converted correctly.

Speculative decoding and cache reuse: where v0.30 feels faster

Two changes in the point releases don't show up as a bigger headline tok/s number but change the lived experience.

Gemma 4 MTP speculative decoding (v0.30.5) uses multi-token-prediction draft heads to propose several tokens at once and verify them in a single pass — lossless output, but Ollama reports over a 2× speedup on Macs for Gemma 4. This is the same family of technique we broke down in why local LLMs got good in 2026: it doesn't raise the memory-bandwidth ceiling, it just wastes fewer trips to it.

KV-cache reuse (v0.30.8) is the quieter win. Before, sending a follow-up message in a long chat re-processed the entire prompt history (the prefill step) every turn. With improved prompt caching, an unchanged prefix is reused, so on a multi-turn conversation the second and later turns skip straight to generation. The bigger your system prompt and the longer your chat, the more time-to-first-token you save — on a long coding session with a 4K-token system prompt, that's the difference between a visible pause and an instant reply on every turn.

You won't see a flag for this. The way to confirm it's helping is crude but honest: time two identical follow-up prompts in the same session. The second should start streaming noticeably sooner because the shared prefix is already cached.

Real numbers, and the ceiling that didn't move

Here's what to actually expect, because "2× faster" is only true in specific places:

Mac / model	Backend	Decode	Notes
M4 Max, Qwen3.5-35B-A3B int4	MLX	~112 tok/s	vs ~58 tok/s on the old Metal path (~93% gain)
M4 Max, optimized 7B	MLX	~230 tok/s	small models show MLX's biggest lead
M3 Ultra, Gemma 4 27B Q4_K_M	MLX	~30–42 tok/s	prefill ~700–900 tok/s
M3 Ultra, Qwen3.6 30B-A3B	MLX	>80 tok/s	MoE sparsity (3B active) is why it's 2× the dense 27B