On-device LLM on iPhone: which runtime is fastest? MLX vs llama.cpp vs LiteRT-LM vs CoreML

#ios #machinelearning #llm #swift

I want to run an LLM on iPhone.
But there are several runtimes and it's not obvious which to pick.

And I couldn't find many head-to-head benchmarks.

Runtime	In a nutshell
MLX	Apple charging into the on-device-LLM scene and pushing hard.
llama.cpp	The mature, battle-tested community standard for local LLMs.
LiteRT-LM	Gemma-4 only, but Google's heavyweight, finally deployed.
CoreML-LLM	Lets you use the Apple Neural Engine, which the GPU/Metal-dominated LLM world tends to overlook. I built it — can it even compete...?

Fine, let's just do it. On an iPhone 17 Pro (A19 Pro), I ran the same model on four on-device inference runtimes and measured decode speed and memory.

The conclusion:

"For local LLMs on iPhone, MLX by default."
"For Gemma 4 specifically, LiteRT-LM is unbeatable."

Conclusion first

Decode speed:
Qwen 3.5 2B is fastest on MLX (61 tok/s).
Gemma 4 E2B is a decisive win for LiteRT-LM (55 tok/s).
Memory:
CoreML / ANE (Apple Neural Engine) wins by a landslide. It runs Qwen 3.5 2B in just 241 MB (about 1/5 of MLX). Slowest on speed, though. Nice effort, CoreML.
Use-case recommendations at the end.

Test conditions

Item	Detail
Device	iPhone 17 Pro (A19 Pro / iOS 26.4.2)
Runtimes	MLX Swift / llama.cpp / LiteRT-LM / CoreML(ANE)
Models	Gemma 4 E2B, Qwen 3.5 2B (both ~4-bit)
Task	short-chat (128-token generation)
Aggregation	median of 3 cold runs
Metrics	decode tok/s (higher = better), peak memory MB (lower = better)

Result 1: decode speed (tok/s, higher is better)

Runtime	Gemma 4 E2B	Qwen 3.5 2B
🔴 LiteRT-LM	55.4 🏆	— (Gemma only)
🟣 MLX-Swift	47.5	61.2 🏆
🔵 llama.cpp	37.8	39.1
🟠 CoreML/ANE	33.4	27.9

For Gemma 4 E2B, LiteRT-LM dominates. It's Google's on-device runtime, running Gemma in its own .litertlm (INT4 QAT) format on the GPU — first-party model × first-party runtime optimization paying off. The Swift API was in development for ages; nice work, whoever shipped it.

Meanwhile for Qwen 3.5 2B, MLX is fastest (61 tok/s). Apple is clearly competing seriously on local LLMs. (LiteRT-LM's catalog is Gemma-only (.litertlm), so it doesn't compete on Qwen.)

Result 2: peak memory (MB, lower is better)

Runtime	Gemma 4 E2B	Qwen 3.5 2B
🔴 LiteRT-LM	641 🏆	—
🟣 MLX-Swift	2,900	1,279
🔵 llama.cpp	3,156	1,479
🟠 CoreML/ANE	1,187	241 🏆

CoreML / ANE wins by a landslide — Qwen 3.5 2B in just 241 MB. That comes from a chunked-MLKV approach (CoreML-LLM's Qwen35MLKVGenerator) that chunks the weights and KV cache onto the ANE — about 1/5 of MLX (1,279) and llama.cpp (1,479).

If you want to run a 2B-class model on a memory-constrained iPhone, or avoid fighting your app and other features for memory, the ANE is a very strong option.

Fairness notes

CoreML/ANE: the ANE is designed for memory/power over throughput. The first load triggers ANE compilation, so load time is longer. Decode is approximated by number of generated pieces (≈ tokens).
LiteRT-LM: there's no max-output-tokens API, so it generates until EOS (≈ 458-token full response); the others cut off at 128. But decode is a rate, so the comparison still holds. Numbers come from LiteRT-LM's own benchmark counter (getBenchmarkInfo).
All are ~4-bit, but the quantization schemes differ slightly per runtime (MLX 4bit / GGUF Q4_K_M / LiteRT INT4-QAT / CoreML INT4-palettized / INT8).

Recommendations by use case

Just fast, general-purpose, lots of models → MLX Swift. Fastest on Qwen, easy from Swift, and mlx-community has tons of models. The first choice for local LLMs on Apple devices.
Gemma as fast as possible → LiteRT-LM. For the Gemma family, strongest on both speed and memory. Can't beat it.
Memory first (any device / coexisting with other features) → CoreML / ANE. 241 MB is exceptional. If you can tolerate the speed, it's the strongest for low memory and power.
Portability / run anywhere → llama.cpp. GGUF assets and every platform. Not flashy, but solid.

Method and reproducibility

Every run was executed headlessly from a Mac via devicectl (no on-device tapping); models were side-loaded from the Mac. The raw result JSONL and charts are in the repo. One line = "1 runtime × 1 model × 1 device" — PRs welcome:

👉 https://github.com/john-rocky/apple-silicon-llm-bench

I'll write a separate article on the behind-the-scenes of full measurement automation and the build fights (git-LFS / SwiftPM unsafe-flags / @preconcurrency, etc.).