A developer just ran a 400-billion parameter large language model on an iPhone 17 Pro. Not on a server. Not through an API. Directly on the phone, with airplane mode on.
The model is called Flash-MoE, an open-source project by @anemll. It generates text at 0.6 tokens per second — roughly one word every two seconds. That's glacially slow compared to cloud inference. But the fact that it runs at all on a device with 12GB of RAM is a genuine engineering breakthrough, and it signals something much bigger for the future of mobile AI.
📊 The numbers: 400 billion parameters. 12GB of RAM. 0.6 tokens/second. The model requires a minimum of 200GB of memory when compressed — the iPhone has 6% of that. Flash-MoE bridges the gap by streaming model weights from SSD to GPU on demand.
This story hit Hacker News and sparked a heated debate about what "running" an LLM actually means, whether this is a stunt or a genuine preview of the future, and how far mobile hardware still needs to go.
What Happened: Flash-MoE on iPhone 17 Pro
The demo, posted by developer @anemll on Twitter, shows an iPhone 17 Pro running a 400B parameter Mixture of Experts (MoE) model entirely on-device. No cloud. No internet. Just the phone's A19 Pro chip and its internal flash storage.
The key insight: this isn't a dense 400B model. It's a Mixture of Experts architecture with 512 experts per layer, where only 4-10 experts are activated for each token. That means the phone never needs to hold all 400B parameters in memory at once — just the small fraction that's actively computing.
Here's how the system works:
- SSD-to-GPU streaming. Instead of loading the entire model into RAM (impossible with 12GB), Flash-MoE streams model weights from the phone's fast NVMe storage directly to the GPU as needed.
- Mixture of Experts routing. The MoE architecture determines which expert sub-networks are needed for each token, then loads only those experts from storage.
- Quantization. The model weights are aggressively compressed to reduce the data that needs to be transferred per expert.
- Speculative decoding. Techniques from Apple's 2023 "LLM in a Flash" research paper help predict which experts will be needed next, pre-fetching them before they're required.
The Apple "LLM in a Flash" Connection
This demo builds directly on Apple's December 2023 research paper, "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory", which laid out the theoretical framework for running models larger than available RAM by intelligently streaming data from flash storage.
The paper proposed two key innovations:
- Windowing — reusing recently activated neurons to reduce data transfer. Since consecutive tokens often activate similar experts, you can keep hot experts in RAM and only load cold ones from storage.
- Row-column bundling — reading larger, contiguous chunks from flash storage rather than many small random reads. Flash storage is fast for sequential reads but slow for random access.
Apple's research showed that these techniques could enable running models up to 2x the available DRAM on Apple silicon. Flash-MoE extends this approach further — to a model that's roughly 17x larger than the iPhone's RAM — by combining it with MoE's inherent sparsity.
💡 Why MoE is the key: A dense 400B parameter model would need to load every parameter for every token. An MoE model with 512 experts per layer only activates 4-10 experts per token — that's less than 2% of the total weights. Combined with SSD streaming, this makes the "impossible" merely very slow.
Why On-Device AI Matters (Even When It's Slow)
1. Privacy Without Compromise
When an LLM runs on your device, your prompts never leave the phone. No data transmitted to servers. No retention policies. No third-party access. For sensitive queries — medical questions, financial planning, legal advice — this is transformative.
2. Offline Access
Cloud AI fails when you need it most — on a plane, in a dead zone, during a server outage. On-device AI works anywhere your phone does.
3. Zero Marginal Cost
Every cloud AI query has a cost — either per-token pricing or a subscription fee. On-device inference is free after the initial hardware investment.
4. Latency for Simple Tasks
For short, simple queries, on-device inference can actually be faster than cloud — no network round-trip, no queue, no cold start.
On-Device vs. Cloud: When Each Wins
| Factor | On-Device AI | Cloud AI |
|---|---|---|
| Privacy | ✅ Complete | ⚠️ Provider-dependent |
| Offline | ✅ Works anywhere | ❌ Requires internet |
| Cost per query | ✅ Free | ⚠️ Per-token/subscription |
| Speed (current) | ❌ 0.6 t/s large models | ✅ 50-200+ t/s |
| Model capability | ⚠️ Limited by RAM | ✅ No constraints |
| Context window | ❌ Severely limited | ✅ 100K-1M+ tokens |
For running AI locally on desktop hardware with more RAM and better GPUs, see our guide to running LLMs on your own hardware.
What This Enables: The On-Device Agent Future
The 0.6 t/s speed is a red herring. The real story is what happens when you combine these techniques with smaller, purpose-built models.
Local Siri Replacement
An on-device language model handling routine requests — timers, factual questions, summarizing notifications, drafting replies — without any server round-trip would be faster, more private, and more reliable than today's cloud-dependent Siri.
Private AI Assistants
Imagine an AI that reads your email, manages your calendar, and drafts responses — all without data ever leaving your phone.
On-Device Agents
The current generation of AI agents runs in the cloud. On-device agents that browse local files and interact with apps — all without a network connection — represent the next frontier. The best AI coding assistants already show what's possible with deep local context.
The Hardware Bottleneck: RAM Is Everything
The iPhone 17 Pro ships with 12GB of LPDDR5X RAM:
- A quantized 7B model needs ~4GB — fits comfortably
- A quantized 14B model needs ~8GB — tight but doable
- A quantized 70B model needs ~40GB — not on current iPhones
- A quantized 400B model needs ~200GB — hence the SSD streaming workaround
The real unlock for practical on-device AI isn't streaming 400B models from storage. It's Apple shipping iPhones with enough RAM to run 14B-30B models comfortably at usable speeds — rivaling today's Claude, ChatGPT, and Gemini for everyday tasks.
⚠️ The honest assessment: Running a 400B model at 0.6 t/s is a technical milestone, not a consumer feature. The real value is proving SSD-to-GPU expert streaming works on mobile. Apply this technique to a 7B or 14B MoE model and you get usable speeds with meaningful capability — entirely on-device.
What Comes Next
Near-term (2026-2027): Apple ships increasingly capable on-device models through iOS updates. Small MoE models (7B-14B) run at 10-20 t/s on flagship phones.
Medium-term (2027-2028): iPhones ship with 16-24GB of RAM. On-device models handle most routine AI tasks at usable speeds. Cloud becomes the fallback, not the default.
Long-term (2028+): The phone becomes the primary AI compute platform for personal tasks. Your private data stays private by default, not by policy.
🔮 The bet: Within two years, your phone will run a 14B-parameter AI model at conversational speed, entirely offline. It won't match GPT-5.4 or Claude Opus on complex reasoning — but for 80% of daily AI use, it'll be indistinguishable. And it'll be free, private, and always available.
The Bottom Line
A 400B LLM running at 0.6 tokens per second on an iPhone is a proof of concept, not a product. But it proves something important: the technique works. SSD-to-GPU streaming, MoE sparsity, and flash-attention optimizations can run models dramatically larger than available RAM on mobile hardware.
The practical implications are enormous. Not because anyone will chat with a 400B model on their phone — but because these same techniques, applied to smaller models, will deliver genuinely useful AI that runs entirely on-device. Private. Offline. Free.
Sources: @anemll on Twitter, Hacker News discussion, Apple "LLM in a Flash" research, WCCFTech coverage

Top comments (0)