DEV Community

Daisuke Majima
Daisuke Majima

Posted on • Originally published at qiita.com

On-device LLM on iPhone: which runtime is fastest? MLX vs llama.cpp vs LiteRT-LM vs CoreML

I want to run an LLM on iPhone.
But there are several runtimes and it's not obvious which to pick.

And I couldn't find many head-to-head benchmarks.

Runtime In a nutshell
MLX Apple charging into the on-device-LLM scene and pushing hard.
llama.cpp The mature, battle-tested community standard for local LLMs.
LiteRT-LM Gemma-4 only, but Google's heavyweight, finally deployed.
CoreML-LLM Lets you use the Apple Neural Engine, which the GPU/Metal-dominated LLM world tends to overlook. I built it — can it even compete...?

Fine, let's just do it. On an iPhone 17 Pro (A19 Pro), I ran the same model on four on-device inference runtimes and measured decode speed and memory.

The conclusion:

"For local LLMs on iPhone, MLX by default."
"For Gemma 4 specifically, LiteRT-LM is unbeatable."

Conclusion first

  • Decode speed:
    Qwen 3.5 2B is fastest on MLX (61 tok/s).
    Gemma 4 E2B is a decisive win for LiteRT-LM (55 tok/s).

  • Memory:
    CoreML / ANE (Apple Neural Engine) wins by a landslide. It runs Qwen 3.5 2B in just 241 MB (about 1/5 of MLX). Slowest on speed, though. Nice effort, CoreML.

  • Use-case recommendations at the end.

Test conditions

Item Detail
Device iPhone 17 Pro (A19 Pro / iOS 26.4.2)
Runtimes MLX Swift / llama.cpp / LiteRT-LM / CoreML(ANE)
Models Gemma 4 E2B, Qwen 3.5 2B (both ~4-bit)
Task short-chat (128-token generation)
Aggregation median of 3 cold runs
Metrics decode tok/s (higher = better), peak memory MB (lower = better)

Result 1: decode speed (tok/s, higher is better)

Runtime Gemma 4 E2B Qwen 3.5 2B
🔴 LiteRT-LM 55.4 🏆 — (Gemma only)
🟣 MLX-Swift 47.5 61.2 🏆
🔵 llama.cpp 37.8 39.1
🟠 CoreML/ANE 33.4 27.9

For Gemma 4 E2B, LiteRT-LM dominates. It's Google's on-device runtime, running Gemma in its own .litertlm (INT4 QAT) format on the GPU — first-party model × first-party runtime optimization paying off. The Swift API was in development for ages; nice work, whoever shipped it.

Meanwhile for Qwen 3.5 2B, MLX is fastest (61 tok/s). Apple is clearly competing seriously on local LLMs. (LiteRT-LM's catalog is Gemma-only (.litertlm), so it doesn't compete on Qwen.)

Result 2: peak memory (MB, lower is better)

Runtime Gemma 4 E2B Qwen 3.5 2B
🔴 LiteRT-LM 641 🏆
🟣 MLX-Swift 2,900 1,279
🔵 llama.cpp 3,156 1,479
🟠 CoreML/ANE 1,187 241 🏆

CoreML / ANE wins by a landslide — Qwen 3.5 2B in just 241 MB. That comes from a chunked-MLKV approach (CoreML-LLM's Qwen35MLKVGenerator) that chunks the weights and KV cache onto the ANE — about 1/5 of MLX (1,279) and llama.cpp (1,479).

If you want to run a 2B-class model on a memory-constrained iPhone, or avoid fighting your app and other features for memory, the ANE is a very strong option.

Fairness notes

  • CoreML/ANE: the ANE is designed for memory/power over throughput. The first load triggers ANE compilation, so load time is longer. Decode is approximated by number of generated pieces (≈ tokens).
  • LiteRT-LM: there's no max-output-tokens API, so it generates until EOS (≈ 458-token full response); the others cut off at 128. But decode is a rate, so the comparison still holds. Numbers come from LiteRT-LM's own benchmark counter (getBenchmarkInfo).
  • All are ~4-bit, but the quantization schemes differ slightly per runtime (MLX 4bit / GGUF Q4_K_M / LiteRT INT4-QAT / CoreML INT4-palettized / INT8).

Recommendations by use case

  • Just fast, general-purpose, lots of models → MLX Swift. Fastest on Qwen, easy from Swift, and mlx-community has tons of models. The first choice for local LLMs on Apple devices.
  • Gemma as fast as possible → LiteRT-LM. For the Gemma family, strongest on both speed and memory. Can't beat it.
  • Memory first (any device / coexisting with other features) → CoreML / ANE. 241 MB is exceptional. If you can tolerate the speed, it's the strongest for low memory and power.
  • Portability / run anywhere → llama.cpp. GGUF assets and every platform. Not flashy, but solid.

Method and reproducibility

Every run was executed headlessly from a Mac via devicectl (no on-device tapping); models were side-loaded from the Mac. The raw result JSONL and charts are in the repo. One line = "1 runtime × 1 model × 1 device" — PRs welcome:

👉 https://github.com/john-rocky/apple-silicon-llm-bench

I'll write a separate article on the behind-the-scenes of full measurement automation and the build fights (git-LFS / SwiftPM unsafe-flags / @preconcurrency, etc.).

Summary

  • On-device LLM on iPhone: "MLX / LiteRT-LM for speed, CoreML/ANE for memory."

Hope it helps your local-LLM development!


Originally published in Japanese on Qiita. I do mobile AI / CoreML / ARKit development and write about it. GitHub / X

Top comments (0)