Ollama Just Got Stupid Fast on Mac and Nobody Is Talking About What This Actually Means

#ai #machinelearning #beginners #programming

So Ollama dropped version 0.19 yesterday and I genuinely think most people are sleeping on how big this is. They rebuilt the entire Mac backend on top of Apple's MLX framework and the speed numbers are kind of absurd. Were talking 1,851 tokens per second on prefill and 134 tokens per second on decode. If those numbers dont mean anything to you, let me put it this way — thats roughly twice as fast as the previous version. On the same hardware. Same model. Just better software underneath.

I've been running local models on my MacBook for months now and the experience has always been this weird mix of "wow this actually works" and "ok why is it taking 15 seconds to start responding." That second part just got obliterated. The time to first token improvement alone changes how it feels to use coding agents locally. When you're running something like Claude Code or OpenCode through Ollama and it responds in under a second instead of making you wait, that's not just a benchmark win, thats a workflow win. The kind of thing that makes you stop reaching for the API and start trusting your local setup.

Heres the deal with what they actually did. Apple has this machine learning framework called MLX that was built specifically for their unified memory architecture. If you're on Apple silicon, your CPU and GPU share the same memory pool, which means you dont have the overhead of copying data back and forth like you do on traditional setups. Ollama was previously using llama.cpp under the hood, which is great and battle-tested, but it wasnt taking full advantage of what Apple's chips can actually do. MLX does. And now Ollama sits on top of it.

The M5 chips get an extra bonus too because Ollama can now tap into the GPU Neural Accelerators that Apple added. So if you're on an M5 Pro or M5 Max, the performance gap compared to older silicon is even wider. But even on M4 hardware the improvement is real, people on the Hacker News thread are reporting noticeably faster responses on their existing machines after updating.

Theres another thing they shipped that nobody seems to be talking about which is NVFP4 support. This is NVIDIAs 4-bit floating point format for quantization and its kind of a big deal for a subtle reason. Most cloud inference providers are starting to use NVFP4 because it gives you better accuracy than integer quantization at similar memory savings. So when Ollama supports it locally, you're getting results that match what you'd get from a production API endpoint. Same quantization format means same model behavior. That matters a lot if you're developing something locally and deploying to cloud, because now your local testing environment actually matches production instead of being some approximation.

The caching improvements are honestly what I'm most excited about though. If you're using Ollama with coding agents you know the pain of repeated system prompts eating into your context and slowing things down. The new version reuses cache across conversations and stores snapshots at smart points in the prompt so when you branch off or start a new conversation with the same tools, it doesnt have to reprocess everything from scratch. For agentic workflows where you might have 15 tool calls in a single session, this adds up fast.

Ok so the honest downside — you need 32GB of unified memory minimum to run this well. Thats the recommened spec from Ollama themselves. If you bought the base model MacBook Air with 16GB or even 24GB, you're prob not going to have a great time running the bigger models they're showcasing like Qwen 3.5 35B. This is one of those things where Apple's upselling on RAM at purchase time actually matters for real workloads, not just having 47 Chrome tabs open.

The Hacker News discussion was interesting because there's a legitimate debate happening. Some people pointed out that Ollama's Go wrapper historically added like 20-30% overhead compared to running llama.cpp directly. With the MLX switch that comparison changes completely because the bottleneck was never really the wrapper, it was the inference backend not using the hardware properly. A few users benchmarked it and the raw MLX performance through Ollama is genuinely close to running MLX directly. Thats impressive engineering.

What I think this actually means for the broader AI dev space is something that's been building for a while now. Local inference is becoming legitimate. Not just "oh cool I can chat with a model offline" legitimate, but "I can run my entire coding agent stack without an API key" legitimate. When you combine fast local inference with the fact that open source models like Qwen 3.5 and Llama are getting scary good, the calculus on whether you need a cloud API subscription starts shifting real fast. I run AI agents all day and honestly the only reason I still hit cloud APIs for some tasks is latency and context length. The latency gap just got a lot smaller.

For anyone who wants to try it, the setup is dead simple. Download Ollama 0.19 from their site, run ollama run qwen3.5:35b-a3b-coding-nvfp4 and you're off. They specifically tuned the sampling parameters for coding tasks on this model which is a nice touch. If you're already using Ollama, just update and everything switches to the MLX backend automatically on Mac. No config changes needed.

I think we're going to look back at 2026 as the year local AI stopped being a hobby project and started being a real alternative to cloud inference for actual work. Between Apple making unified memory mainstream, NVIDIA pushing better quantization formats, and projects like Ollama making it all accessible through a single command, the pieces are falling into place faster than most people realize. Your MacBook is becoming an AI workstation and thats not hype, thats just what the benchmarks show.

DEV Community

Ollama Just Got Stupid Fast on Mac and Nobody Is Talking About What This Actually Means

Top comments (0)