iPhone on-device LLM: the GPU wins the sprint, the Neural Engine wins the marathon

#machinelearning #ios #apple #llm

The follow-up to my on-device runtime speed benchmark — because burst tok/s only tells half the story.

I benchmark on-device LLMs on iPhone, and shipping real apps I kept noticing the GPU runtimes start fast and then fade. So I measured decode rate over time, from a cold start, over 10 minutes of continuous generation. Same conditions across all runtimes: same model (Gemma 4 E2B, 4-bit), same iPhone 17 Pro (A19 Pro), cold start.

The GPU runtimes (MLX, LiteRT-LM) that crush the burst test collapse 50%+ within ~60 seconds as the phone heats. The slowest starter — the Apple Neural Engine (CoreML) — barely throttles and overtakes them.

Decode rate: cold burst vs sustained (Gemma 4 E2B 4-bit)

Runtime (compute)	Burst tok/s	Sustained (10 min)	Retained
CoreML / ANE	33	22	67%
MLX / GPU	48	18	38%
LiteRT-LM / GPU	56	27	48%

The GPU runtimes fall to 38% / 48% of their own peak after 10 minutes. The ANE holds 67% — ending up faster than MLX outright, and shrinking LiteRT's lead from +70% to +23%.

How fast they fade (vs each runtime's own peak)

                 -10%      -25%     floor
MLX / GPU          5s        35s     ~18 tok/s
LiteRT / GPU      13s        40s     ~27 tok/s
CoreML / ANE      93s       390s     ~22 tok/s

The GPU runtimes are down 25% in well under a minute. The ANE takes over 90 seconds just to lose 10%.

Why — it's a power story

iOS won't give you per-subsystem watts, and over a long run the phone throttles every backend down to the same sustainable thermal envelope — so iPhone battery-delta can't separate them. But the same model on Mac (M4 Max), where powermetrics exposes package power, shows the cause cleanly:

CoreML / ANE: 12.7 W
MLX / GPU: 24.7 W
llama.cpp / GPU: 24.5 W

The ANE path draws ~half the GPU's power at full decode. Lower power → heats slower → the thermally-constrained iPhone doesn't have to throttle it. Two independent GPU runtimes — Apple's MLX and Google's LiteRT-LM — collapsing the same way says this is a GPU-thermal property of the phone, not a quirk of one runtime.

Takeaway: GPU for the sprint, ANE for the marathon

Quick chat reply → the GPU wins outright; burst speed is the experience.
Sustained load (long generation, agentic loops, batch/background jobs) → the GPU's burst lead largely evaporates. MLX ends up slower than the ANE; LiteRT keeps only a slim lead after shedding half its speed.
And the ANE draws ~half the power and leaves the GPU free for the rest of the app (rendering, camera, other ML).

That's the real case for running an LLM on the ANE, even though its peak decode is lower.

Caveats (so you can poke holes)

Burst = one cold ~128-token generation. Sustained = a 600 s re-prompt loop, decode rate from a rolling window. The in-loop re-prompt overhead shaves a little off the absolute rate (more for CoreML, whose prefill is relatively costlier), so I quote burst from the clean single shot.
iOS only exposes battery in 1% steps, and under sustained load the SoC pins every backend to the same thermal-budget power. The power chart is measured on Mac (M4 Max, powermetrics, same model) — the mechanism behind the iPhone throttling, on a device where per-unit watts are observable. The throttle curves themselves are iPhone-measured and clean.
LiteRT-LM has no output-token cap, so its per-call generations run longer than the 128 used for CoreML/MLX; and that one run happened to start at fair rather than nominal thermal (its burst was unaffected).
CoreML-LLM uses sliding-window attention, part of why its decode cost stays flat (bounded context) — it trades some long-range recall for that.

Repo (raw data + scripts)

Raw JSONL for all three runs and the script that draws these curves: https://github.com/john-rocky/apple-silicon-llm-bench

Does this match what you see on Android (Snapdragon Hexagon / Tensor NPU vs the GPU)? And for your workload — does the GPU's burst advantage or the NPU's endurance matter more?