I want to run an LLM on iPhone.
But there are several runtimes and it's not obvious which to pick.
And I couldn't find many head-to-head benchmarks.
| Runtime | In a nutshell |
|---|---|
| MLX | Apple charging into the on-device-LLM scene and pushing hard. |
| llama.cpp | The mature, battle-tested community standard for local LLMs. |
| LiteRT-LM | Gemma-4 only, but Google's heavyweight, finally deployed. |
| CoreML-LLM | Lets you use the Apple Neural Engine, which the GPU/Metal-dominated LLM world tends to overlook. I built it — can it even compete...? |
Fine, let's just do it. On an iPhone 17 Pro (A19 Pro), I ran the same model on four on-device inference runtimes and measured decode speed and memory.
The conclusion:
"For local LLMs on iPhone, MLX by default."
"For Gemma 4 specifically, LiteRT-LM is unbeatable."
Conclusion first
Decode speed:
Qwen 3.5 2B is fastest on MLX (61 tok/s).
Gemma 4 E2B is a decisive win for LiteRT-LM (55 tok/s).Memory:
CoreML / ANE (Apple Neural Engine) wins by a landslide. It runs Qwen 3.5 2B in just 241 MB (about 1/5 of MLX). Slowest on speed, though. Nice effort, CoreML.Use-case recommendations at the end.
Test conditions
| Item | Detail |
|---|---|
| Device | iPhone 17 Pro (A19 Pro / iOS 26.4.2) |
| Runtimes | MLX Swift / llama.cpp / LiteRT-LM / CoreML(ANE) |
| Models | Gemma 4 E2B, Qwen 3.5 2B (both ~4-bit) |
| Task | short-chat (128-token generation) |
| Aggregation | median of 3 cold runs |
| Metrics | decode tok/s (higher = better), peak memory MB (lower = better) |
Result 1: decode speed (tok/s, higher is better)
| Runtime | Gemma 4 E2B | Qwen 3.5 2B |
|---|---|---|
| 🔴 LiteRT-LM | 55.4 🏆 | — (Gemma only) |
| 🟣 MLX-Swift | 47.5 | 61.2 🏆 |
| 🔵 llama.cpp | 37.8 | 39.1 |
| 🟠 CoreML/ANE | 33.4 | 27.9 |
For Gemma 4 E2B, LiteRT-LM dominates. It's Google's on-device runtime, running Gemma in its own .litertlm (INT4 QAT) format on the GPU — first-party model × first-party runtime optimization paying off. The Swift API was in development for ages; nice work, whoever shipped it.
Meanwhile for Qwen 3.5 2B, MLX is fastest (61 tok/s). Apple is clearly competing seriously on local LLMs. (LiteRT-LM's catalog is Gemma-only (.litertlm), so it doesn't compete on Qwen.)
Result 2: peak memory (MB, lower is better)
| Runtime | Gemma 4 E2B | Qwen 3.5 2B |
|---|---|---|
| 🔴 LiteRT-LM | 641 🏆 | — |
| 🟣 MLX-Swift | 2,900 | 1,279 |
| 🔵 llama.cpp | 3,156 | 1,479 |
| 🟠 CoreML/ANE | 1,187 | 241 🏆 |
CoreML / ANE wins by a landslide — Qwen 3.5 2B in just 241 MB. That comes from a chunked-MLKV approach (CoreML-LLM's Qwen35MLKVGenerator) that chunks the weights and KV cache onto the ANE — about 1/5 of MLX (1,279) and llama.cpp (1,479).
If you want to run a 2B-class model on a memory-constrained iPhone, or avoid fighting your app and other features for memory, the ANE is a very strong option.
Fairness notes
- CoreML/ANE: the ANE is designed for memory/power over throughput. The first load triggers ANE compilation, so load time is longer. Decode is approximated by number of generated pieces (≈ tokens).
-
LiteRT-LM: there's no max-output-tokens API, so it generates until EOS (≈ 458-token full response); the others cut off at 128. But decode is a rate, so the comparison still holds. Numbers come from LiteRT-LM's own benchmark counter (
getBenchmarkInfo). - All are ~4-bit, but the quantization schemes differ slightly per runtime (MLX 4bit / GGUF Q4_K_M / LiteRT INT4-QAT / CoreML INT4-palettized / INT8).
Recommendations by use case
-
Just fast, general-purpose, lots of models → MLX Swift. Fastest on Qwen, easy from Swift, and
mlx-communityhas tons of models. The first choice for local LLMs on Apple devices. - Gemma as fast as possible → LiteRT-LM. For the Gemma family, strongest on both speed and memory. Can't beat it.
- Memory first (any device / coexisting with other features) → CoreML / ANE. 241 MB is exceptional. If you can tolerate the speed, it's the strongest for low memory and power.
- Portability / run anywhere → llama.cpp. GGUF assets and every platform. Not flashy, but solid.
Method and reproducibility
Every run was executed headlessly from a Mac via devicectl (no on-device tapping); models were side-loaded from the Mac. The raw result JSONL and charts are in the repo. One line = "1 runtime × 1 model × 1 device" — PRs welcome:
👉 https://github.com/john-rocky/apple-silicon-llm-bench
I'll write a separate article on the behind-the-scenes of full measurement automation and the build fights (git-LFS / SwiftPM unsafe-flags /
@preconcurrency, etc.).
Summary
- On-device LLM on iPhone: "MLX / LiteRT-LM for speed, CoreML/ANE for memory."
Hope it helps your local-LLM development!
Originally published in Japanese on Qiita. I do mobile AI / CoreML / ARKit development and write about it. GitHub / X

Top comments (0)