DEV Community

Cover image for Ollama Just Got 93% Faster on Mac. Here's How to Enable It.
Alan West
Alan West

Posted on

Ollama Just Got 93% Faster on Mac. Here's How to Enable It.

My M4 Max was decoding Qwen3.5 at 58 tokens per second yesterday. Today it's doing 112. Same model, same hardware, same prompt. The only thing that changed was a single environment variable.

Ollama 0.19 shipped on March 31, 2026 with a preview of its MLX backend -- Apple's own machine learning framework, designed from the ground up for unified memory. If you run local models on a Mac, this is the biggest performance jump since Metal support landed.

Here's how to enable it and what to expect.

What MLX Changes Under the Hood

Ollama has historically used llama.cpp's Metal backend for Apple Silicon inference. Metal works, but it treats the GPU as a discrete accelerator -- data moves between CPU memory and GPU memory even though, on Apple Silicon, they're the same physical memory. That copy overhead is wasted time.

Apple's MLX framework understands unified memory natively. Tensors live in a single address space. The CPU and GPU operate on the same data without copying. For large language models where the bottleneck is memory bandwidth, eliminating those copies translates directly into faster token generation.

The Ollama team's published numbers tell the story clearly:

Metric llama.cpp (Metal) MLX Backend Improvement
Prefill (tokens) 1,154 tok/s 1,810 tok/s 1.6x
Decode (tokens/s) 58 tok/s 112 tok/s ~1.93x (93%)

Prefill is how fast the model processes your input prompt. Decode is how fast it generates output tokens. Both matter, but decode speed is what you feel -- it's the difference between text that streams smoothly and text that stutters.

A 93% improvement in decode speed means the model generates tokens almost twice as fast. For a response that used to take 10 seconds, you're now looking at roughly 5.

Step 1: Update Ollama

First, check your current version.

ollama --version
Enter fullscreen mode Exit fullscreen mode

If you're below 0.19, update. On macOS, the simplest path is to download the latest release from ollama.com, or if you installed via Homebrew:

brew upgrade ollama
Enter fullscreen mode Exit fullscreen mode

Verify you're on 0.19 or later:

ollama --version
# ollama version is 0.19.0
Enter fullscreen mode Exit fullscreen mode

Step 2: Benchmark Your Baseline

Before enabling MLX, capture your current performance so you have a real before/after comparison. Run a generation and note the stats Ollama reports.

ollama run qwen3.5 --verbose <<< "Write a Python function that implements binary search on a sorted array. Include error handling and type hints."
Enter fullscreen mode Exit fullscreen mode

The --verbose flag prints timing information after the response completes, including prompt eval rate (prefill) and eval rate (decode). Write those numbers down.

On my M4 Max with 64GB, the baseline looked like this:

prompt eval rate:    1,147.3 tokens/s
eval rate:           57.8 tokens/s
total duration:      4.2s
Enter fullscreen mode Exit fullscreen mode

Step 3: Enable the MLX Backend

The MLX backend is a preview feature in 0.19. It's not on by default. You enable it with an environment variable before starting the Ollama server.

# Stop any running Ollama instance
osascript -e 'quit app "Ollama"'

# Set the environment variable and restart
OLLAMA_MLX=1 ollama serve
Enter fullscreen mode Exit fullscreen mode

If you run Ollama as a background service, you'll need to set the variable in your launch configuration. For a quick test, just run it in a terminal window.

In a second terminal, pull a supported model and run the same benchmark:

ollama run qwen3.5 --verbose <<< "Write a Python function that implements binary search on a sorted array. Include error handling and type hints."
Enter fullscreen mode Exit fullscreen mode

My results after enabling MLX:

prompt eval rate:    1,803.6 tokens/s
eval rate:           111.4 tokens/s
total duration:      2.3s
Enter fullscreen mode Exit fullscreen mode

Same prompt, same model, same machine. Total duration nearly halved. The decode rate jumped from 57.8 to 111.4 tokens per second -- a 93% improvement that matches Ollama's published numbers almost exactly.

Hardware Requirements

The MLX backend preview has a hard requirement: 32GB or more of unified memory. If you have a base M1/M2/M3/M4 with 8GB or 16GB, the MLX backend will not activate. This isn't an arbitrary gate -- MLX's memory management strategy needs headroom to avoid thrashing, and the models that benefit most from this backend are the ones that actually utilize significant memory bandwidth.

Here's how it maps to current hardware:

Chip Unified Memory MLX Backend? Expected Decode Gain
M1/M2 (8GB) 8GB No N/A
M1/M2 (16GB) 16GB No N/A
M3 Pro (36GB) 36GB Yes ~60-80%
M4 Pro (48GB) 48GB Yes ~80-90%
M4 Max (64GB) 64GB Yes ~90-95%
M5 Max (128GB) 128GB Yes ~90-95%+

On the M5 family (M5, M5 Pro, M5 Max), the gains are even more pronounced because those chips have GPU Neural Accelerators that MLX can leverage for both time-to-first-token and sustained generation speed. If you're on an M5 Max, Ollama with MLX is the fastest local inference setup available on any consumer hardware.

Which Models Work

This is where the preview limitations show. As of 0.19, MLX backend support is confirmed for Qwen3.5 models. Not every model in the Ollama library works with the MLX backend yet. If you try to run an unsupported model with OLLAMA_MLX=1, Ollama will silently fall back to the Metal backend -- you won't get an error, you just won't get the speed improvement.

Ollama 0.20, which is already in development, adds Gemma 4 support to the MLX backend. The team is working through the model architectures one at a time, which makes sense -- each architecture needs a dedicated MLX implementation, and shipping a broken one would be worse than shipping none.

For now, if you primarily use Qwen3.5 (which is a strong choice for coding tasks), you get the full benefit immediately. If you rely on Llama, Mistral, or other architectures, you'll need to wait for subsequent releases.

The Honest Assessment

The speed improvement is real. MLX's unified memory approach is genuinely better suited to Apple Silicon than Metal's discrete-GPU-emulation model. The 93% decode improvement I measured matches the official numbers.

But the preview limitations are real too. Model support is narrow. The 32GB memory requirement cuts out every base-spec MacBook Air and most base-spec MacBook Pros. There's no fallback notification when a model doesn't support MLX -- you have to check your speeds manually to know if it activated.

None of that changes the recommendation. If you have a Mac with 32GB+ and you use Qwen3.5, enable MLX right now. The speed difference fundamentally changes how interactive local model usage feels. For everything else, wait for 0.20 and the expanding model support.

Ollama 0.19 is the first release that treats Apple Silicon as a first-class platform rather than a Metal-compatible GPU. That matters more than any single benchmark number.

Top comments (0)