DEV Community

SomeOddCodeGuy
SomeOddCodeGuy

Posted on • Originally published at someoddcodeguy.dev

Llama.cpp's New MTP on MacOS

MTP

So I decided to test out the new MTP in llama.cpp on Metal using my M2 Ultra, and figured I'd toss the results up here. This isn't meant to show the maximum tps you can get on Mac hardware; I'd have run it on the M5 Max or M3 Ultra if that were the case. My goal is to see what overall percentage gains we might expect to see across the various spec-draft-n-max sizes, which I could do on any of the devices.

MTP Test Runs

  • Hardware (M2 Ultra Mac Studio, 192GB unified memory)
  • Model (Qwen3.6-35B-A3B UD-Q8_K_XL, an MoE)
  • llama.cpp build (b9196)
  • The exact flags: --seed 42, --no-cache-prompt, thinking disabled, single prompt repeated 3x per setting
  • RAG against a wikipedia article (no code, since everyone else is benchmarking code).
  • n for these runs is spec-draft-n-max

Token Generation

Config Mean t/s Speedup Mean acceptance Variance
No MTP (baseline) 68.07 1.00x n/a ±0.02
n=2 73.04 1.07x 86.16% ±1.2
n=3 76.00 1.12x 78.29% ±0.3
n=4 77.68 1.14x 76.72% ±4.1
n=5 74.68 1.10x 67.97% ±2.6
n=6 73.68 1.08x 66.26% ±5.4
n_max Run 1 t/s Run 2 t/s Run 3 t/s Mean t/s Run 1 acc Run 2 acc Run 3 acc Mean acc
2 72.30 72.26 74.57 73.04 84.66% 84.66% 89.15% 86.16%
3 76.23 76.16 75.61 76.00 78.18% 78.18% 78.51% 78.29%
4 79.08 72.90 81.05 77.68 78.13% 70.66% 81.38% 76.72%
5 72.86 73.06 78.12 74.68 65.87% 65.87% 72.16% 67.97%
6 66.48 77.29 77.27 73.68 58.11% 70.34% 70.34% 66.26%

Prompt Processing

Config Mean PP t/s Loss vs baseline
No MTP (baseline) 1015.34
n=2 841.72 -17.1%
n=3 842.80 -17.0%
n=4 846.62 -16.6%
n=5 834.57 -17.8%
n=6 836.42 -17.6%

Without MTP, my three baseline runs produced essentially identical numbers: 68.05, 68.06, and 68.09 t/s. But the moment I turned MTP on, runs at the same n_max value started drifting from each other, and the drift got worse as n_max went up. At n=3, the runs stayed within 0.6 t/s of each other. At n=6, the gap between best and worst hit 11 t/s. I don't have a definitive explanation, but my best guess is that MTP's batched verification step introduces enough floating-point ordering variance on Metal that generation paths diverge between otherwise-identical runs. That's why I'd lean toward n=3 even though n=4 has a slightly higher mean, since n=3 stayed reliably consistent.

Your mileage may vary on the numbers for your setup, but the loss on prompt processing looks pretty static no matter what I pick.

NOTE: I built b9200, which is supposed to have the prompt processing improvement code merged in. My PP speed on n=3 was still around 882 tps, so not a huge jump.


For my full llama.cpp run command, I use this:

./llama-server -ngl 99 -c 65535 -fa on --spec-type draft-mtp --spec-draft-n-max 4 --model ~/models/MTP_Qwen3.6-35B-A3B-UD-Q8_K_XL/Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf --mmproj ~/models/MTP_Qwen3.6-35B-A3B-UD-Q8_K_XL/mmproj-F32.gguf --image-min-tokens 2048 --image-max-tokens 8192 --parallel 1 --host 0.0.0.0 --jinja --port 5003

  • --ngl 99 High number to guarantee no offloading. Means all the model should go into Metal / your GPU
  • -fa on Specifying that Flash Attention should be on
  • --parallel 1 I don't do parallel prompts, Mac just doesn't handle it well, but the way llama.cpp handles cache checkpoints is affected by this and I've noticed a slowdown when parallel is above 1 because of that, so I keep this on to be safe
  • --image-min-tokens 2048 --image-max-tokens 8192 This enforces a higher quality on the vision portion of the model. I had another post where I mentioned that, but the quality with this set vs not is night and day. Just note that each model has its own acceptable settings
  • --jinja Telling llama.cpp to use the jinja template that comes with the model. You want this on unless you know why you don't.
  • --host 0.0.0.0 Host of 0.0.0.0 is the same as "--listen" in some programs: it lets you connect to this instance of llama.cpp server from other computers on your network, if you want.
  • --port 5003 Sets the port to connect to; I specify it because I run multiple instances of llama.cpp at once, for different models.
  • -c 65535 The context size to load. I choose 65535 tokens

NOTE: There's a warning that sending an image input while MTP is enabled can crash llama.cpp. I kept vision on when I ran all my tests, and have sent a couple of images in other conversations with it on and haven't seen the crash, but just a note in case you hit any issue there.

Top comments (0)