DEV Community

SleepyQuant
SleepyQuant

Posted on • Originally published at sleepyquant.rest

MLX vs llama.cpp on M1 Max with 35B Q8 — The Honest Benchmark

MLX vs llama.cpp on M1 Max with 35B Q8 — The Honest Benchmark

I tested both. Same machine (M1 Max 64 GB), same model (Qwen 3.6 35B-A3B Q8), same prompts, same generation lengths. llama.cpp came out about 30% faster on raw decode throughput. I stayed on MLX anyway.

This is the breakdown of what each gets right, where the speed gap actually shows up, and why I stayed on the slower one. Hopefully useful if you're picking your local inference stack from scratch.

Setup

  • Hardware: MacBook Pro M1 Max, 64 GB unified memory, 1 TB SSD, on macOS Sequoia
  • Model: mlx-community/Qwen3.6-35B-A3B-8bit for MLX, equivalent GGUF Q8_0 for llama.cpp (Qwen3.6-35B-A3B-Q8_0.gguf)
  • Prompts: 5 prompts, mix of short Q&A (50 tokens output) and longer content generation (500 tokens output), 3 runs each, warm-cache results
  • MLX setup: MLX_FORCE_FP16=1, wired_limit 45 GB, cache_limit 512 MB
  • llama.cpp setup: Metal backend enabled, --n-gpu-layers -1 (all on GPU), threads 8, context 8192

I'll caveat upfront: this is one machine, one model, one workload pattern. Your numbers will be in the same shape but different magnitudes. If you replicate and your numbers diverge significantly, I'd genuinely like to know.

Raw throughput

Metric MLX (fp16) llama.cpp (Metal)
Decode tok/s (steady state, 500-token gen) 26.22 34
Prefill tok/s (1k token prompt) ~190 ~245
Cold-start latency (first token) 1.8s 1.2s
Memory resident ~35 GB ~37 GB
Memory peak (under load) ~42 GB ~44 GB

llama.cpp wins all three speed measures. About 30% faster on decode, similar margin on prefill. Cold start is also faster — less Python/MLX import overhead, more direct Metal binding.

Memory usage is comparable. MLX edges out slightly because it doesn't keep a separate GGUF reader buffer. Not enough to be a deciding factor.

If your primary need is "generate as many tokens as fast as possible, batch workload, throughput-bound" — llama.cpp wins. Stop reading and switch.

Where the speed difference doesn't show up

In my actual day-to-day usage, the 30% speed gap doesn't translate to 30% better experience. Here's why.

Interactive chat: I read at maybe 5-7 tok/s of comprehension. Whether the model generates at 26 or 34 tok/s, I'm waiting for me, not the model. The gap is invisible.

Agent loops where I parse output: the bottleneck is round-trip time, not generation speed. A 100-token JSON response takes 3.8s on MLX vs 2.9s on llama.cpp. Both feel fast to me. The agent loop spends most of its time on tool execution and HTTP calls, not inference.

Sectional generation for long content (which I do because of MoE degeneration on long contexts): I generate 300-token sections. Each section takes ~11 seconds on MLX, ~8.5 on llama.cpp. Difference is 2.5 seconds per section, ~15 seconds total for a 6-section blog post. Imperceptible vs the time I spend reviewing and editing the output.

The speed gap matters when you're running batch inference (e.g., re-scoring a dataset, generating 10k synthetic examples). For interactive or agent workloads on a single user, it's largely cosmetic.

Where MLX wins (that's why I stayed)

These are the reasons I haven't switched, in rough priority order.

1. Python-native API. MLX is pip install mlx mlx-lm and you're calling Python functions directly. Tokenizers integrate cleanly with HuggingFace patterns I already know. Chat templates work through standard tokenizer.apply_chat_template. No subprocess, no IPC, no llama.cpp server.

llama.cpp has Python bindings (llama-cpp-python), but they're a layer over the C++ core. Some features lag the main project, tokenizer behavior occasionally diverges from upstream HF, and the bindings need to be rebuilt when you upgrade. Not deal-breakers, but friction adds up.

2. Quantization and format flexibility. MLX reads safetensors directly. I can swap from Q4 to Q8 to fp16 by changing the model path. No re-conversion step.

llama.cpp uses GGUF, which is a different format. To switch quants, you either download a different GGUF (if someone published one) or convert from safetensors yourself with convert.py. For exotic models or custom finetunes, this is real overhead.

3. The MoE handling is genuinely better. Qwen 3.6 35B-A3B is a Mixture-of-Experts model. MLX's MoE routing implementation has been measurably more stable for me on long generations. llama.cpp had a few weeks early in 2026 where Qwen MoE inference would silently produce different outputs run-to-run because of router determinism bugs. Fixed now, but it shook my confidence.

4. Sectional generation pipeline already built. I have a working setup for sectional gen that integrates with my agent stack. Switching to llama.cpp would mean re-implementing the same flow. Re-implementation cost: probably 1-2 weeks, including testing and the inevitable bugs in the new setup. 30% speed gain doesn't pay back 2 weeks of work for an interactive workload.

5. Apple is investing in MLX directly. This is a soft factor, not a benchmark. But MLX is Apple's first-party ML framework for their silicon. Improvements compound faster when the chip designer is also writing the framework. llama.cpp is community-maintained, brilliant work, but not Apple-funded.

Where llama.cpp wins (and might still be the right pick)

To be fair to the project, here's where llama.cpp is clearly the better choice.

Batch inference at scale. If you're scoring 100k prompts overnight, the 30% speed advantage compounds. Save 8 hours on a 24-hour job.

Embedded/constrained environments. llama.cpp compiles to a tiny binary and runs on a wider range of hardware. If you're shipping a desktop app to users with mixed Macs, llama.cpp gives you broader compatibility.

Quantization research. llama.cpp's quant ecosystem (Q4_K_M, Q5_K_S, IQ-quants) is broader and more experimental than MLX. If you're testing new quant strategies, llama.cpp moves faster.

Cross-platform deployment. Need the same inference code to run on Linux, Mac, Windows, Android? llama.cpp does all four. MLX is Apple-only.

Independent of Python ecosystem. No GIL, no Python import dance. If you're already writing C++ or Rust, llama.cpp slots in cleaner.

What I'd recommend by use case

A short list, since this post has gotten long.

Use case Pick Why
Interactive chat on a Mac MLX Speed gap invisible, Python integration matters more
Agent loops on a Mac MLX Same as chat, plus MoE stability
Local API server (single user) Either Personal pick. Toss-up.
Batch dataset scoring llama.cpp 30% speed gap × N tokens = real hours saved
Cross-platform desktop app llama.cpp MLX is Apple-only
Embedded inference (mobile, edge) llama.cpp Smaller binary, fewer deps
Custom quantization R&D llama.cpp Broader quant ecosystem

If you're a Mac dev building an agent stack or chat app for yourself: MLX is the easier path. If you're shipping inference at scale or beyond Mac: llama.cpp.

Will I switch later?

Maybe. The condition I'm watching: if MLX's release cadence slows or if llama.cpp's Python bindings catch up on tokenizer behavior, the trade looks different.

I check in on both every couple of months. Last check was April 2026. Plan to check again July 2026. If anything material changes, I'll write the update.

If you've benchmarked these two on different hardware or a different model and your conclusion is different, I'd genuinely like to see your numbers. Reply on the post.

Come along for the ride — see me fall or thrive, whichever comes first.

Top comments (0)