Back in 2005, when I was running the cyber café, I had my first real encounter with the gap between something working and something being useful. I set up a proxy cache to optimize bandwidth. It ran perfectly. The logs showed hits. And yet, the kids kept complaining that Counter-Strike lagged just as bad. The technical solution existed, but the real problem — real-time game latency — wasn't even close to solved.
When I read that Gemma 4 runs natively on iPhone with fully offline inference, the first thing I did was try to replicate it. I've got an iPhone 13 with 128GB nearly full — dog photos, Xcode builds I forgot to delete, three TestFlight versions of my own projects. Not exactly ideal hardware. But here we go.
Spoiler: it works. And the story of why that's not enough is exactly the same story as all mobile local inference since we first started trying.
LLM on-device on iPhone: what Google actually pulled off
Gemma 4 is the fourth generation of Google's open-source model family. What changed this time is the distribution of the model in quantized variants small enough to run on ARM mobile hardware with a Neural Engine — specifically Apple's A-series chips.
The setup uses MediaPipe LLM Inference API, which Google released for iOS, and a Gemma 4 model in quantized INT4 or INT8 format depending on the variant. The total model weight that ends up on the device is around 2GB for the smallest variant. Not trivial when your phone has 4GB of RAM shared between the OS, apps, and the model itself.
What makes this technically interesting isn't that an LLM runs on mobile — that happened before, with LLaMA and Mistral variants. What's different here is:
- Official Google support — which brings serious tooling, not an experimental weekend port
- MediaPipe integration — which already has mature infrastructure for edge inference
- The timing — landing exactly when Apple is under maximum pressure for falling behind on AI
How I replicated it (including everything that went wrong)
First attempt: follow the official MediaPipe docs for iOS. The basic setup requires an Xcode project with MediaPipeTasksGenAI as an SPM dependency:
// Package.swift — add the MediaPipe dependency
.package(
url: "https://github.com/google/mediapipe",
// Check for the latest version before using this
from: "0.10.14"
),
The model itself you download from Hugging Face — the gemma-4-it-gpu-int4 variant is what I tested. It's a 1.8GB download. With my home connection, that took exactly long enough for me to regret starting it halfway through and keep going anyway.
Once you have the model in the bundle (or a local path), the initialization is surprisingly clean:
import MediaPipeTasksGenAI
// Configure LLM options
let options = LlmInference.Options(modelPath: modelPath)
options.maxTokens = 1024
// maxTopK controls response diversity
options.maxTopK = 40
// Initialize inference — this takes several seconds
let llmInference = try LlmInference(options: options)
// Generate a response
let result = try llmInference.generateResponse(
inputText: "Explain what a transformer is in 3 lines"
)
print(result)
First real problem: model load time. On my iPhone 13, initializing the model context takes between 8 and 12 seconds. That's not a number you can hide behind a "processing your request" spinner. It's an eternity in mobile UX terms.
Second problem: generation speed. The INT4 variant generates around 8–12 tokens per second on my hardware. For short text, that's acceptable. For any response that needs more than 200 tokens, you're staring at a blinking cursor for 20 seconds. The average user closes the app in 10.
Third problem — and this is the most interesting one — response quality at that model size. Gemma 4 in the quantized mobile variant is not the same Gemma 4 you run on a server with an A100. The quantization, context trimming, low-memory optimizations: all of that gets paid for in quality. Not dramatically, but enough that the experience feels like talking to an assistant who took a sleeping pill.
The real gap: between 'runs' and 'useful for something concrete'
Here's where the story gets interesting, because this isn't unique to Gemma 4 or iPhone. It's the story of all local inference on devices since we first started trying.
The problem isn't technical in the classic sense. The numbers are there — the model runs, it generates tokens, it doesn't crash. The problem is that the use cases that justify having an LLM embedded in a mobile app are exactly the use cases that suffer most from hardware constraints:
- Conversational assistants: you need long context and fast response time. Both are limited.
- Offline text processing: here the use case is stronger — taking notes and summarizing them without internet, for example. At 8–12 tokens/sec with reasonable quality, this starts to make sense.
- Offline code completion: forget it. You need a larger model with specific code training for that.
- RAG over local documents: this is the most promising case. If you combine local inference with a well-configured MCP server, the privacy of local data becomes a concrete argument.
What became clear to me after two days playing with this: the use cases where mobile local inference genuinely wins are the cases where the model doesn't need to be very smart — text classification, entity extraction, simple sentiment analysis. For that, there are more efficient solutions than a 2GB LLM.
For the cases where you actually need real intelligence, tokens per second still aren't there.
What this does to the Apple-as-AI-laggard argument
Here's the truly interesting part of this moment.
Apple has been getting hammered by analysts, tech press, users, everyone — saying it fell behind on AI. Siri is an embarrassment compared to Gemini or ChatGPT. Apple Intelligence arrived late and with fewer features than promised. The A-series chip has had a Neural Engine since the A11, but they underused it for years.
And now Google — not Apple — is the one demonstrating that Apple's hardware can run serious local inference.
That's an interesting strategic move. Google is basically saying: "the hardware you've been selling for years is good enough for what we built." Apple ends up in an uncomfortable position where its own silicon is being used as an argument for a competitor's ecosystem.
At the same time, this pressures Apple to show what it can do with full-stack control — something Google doesn't have. If anyone should be able to optimize local inference on iPhone better than anyone else, it's Apple. They have the compiler, the runtime, the hardware, and the operating system.
What we're watching is basically Google forcing Apple's hand. And for those of us building software, that means the window for Apple to ignore this is closing fast.
It also reminds me of something I learned when I was optimizing Docker images for production: size matters, but it matters relative to what it gives you. A 2GB model generating 10 tokens/sec has to justify that cost with use cases that genuinely can't go to the cloud. Privacy. Offline. Zero network latency.
Those use cases exist. But there are fewer of them than the hype implies.
Common mistakes when you try this
Mistake 1: Downloading the largest available model
There are larger Gemma 4 variants. On iPhone, don't use them. The memory limit for an iOS app under normal conditions is around 3–4GB before the system starts killing processes. A large model eats all that space and the OS kills your app before you finish loading.
Mistake 2: Not treating load time as a first-class citizen
Model onboarding has to happen in the background, with persistent state. You can't initialize the model context on every request. Load it once, keep the state, and if the system kills it for memory, reload it with a UI that communicates that clearly. If you over-engineer the agent architecture before solving this basic problem, you'll build something that's never usable.
Mistake 3: Expecting cloud API quality
The quantized mobile model is not the same model. Lower your quality expectations one notch and design more directive prompts — less ambiguity, shorter contexts. The difference between a well-designed prompt for mobile inference and a generic one can be enormous.
Mistake 4: Not measuring tokens/sec on your target device
Every chip generation is different. What runs fine on an iPhone 15 Pro can be unusable on an iPhone 12. Measure on the most limited hardware in your target audience before committing to the feature.
// Measure generation speed — useful for deciding if the model is viable
let startTime = Date()
var tokenCount = 0
// Generate with callback to count tokens
try llmInference.generateResponseAsync(
inputText: prompt
) { partialResult, error in
if let partial = partialResult {
// Approximation: count words as a proxy for tokens
tokenCount += partial.split(separator: " ").count
}
}
let elapsed = Date().timeIntervalSince(startTime)
let tokensPerSec = Double(tokenCount) / elapsed
print("Approximate speed: \(tokensPerSec) tokens/sec")
FAQ — Common questions about on-device LLM on iPhone
What's the minimum iPhone you need to run Gemma 4 offline?
Google's official guide indicates compatibility from iPhone 12 onward, which has the A14 Bionic chip with a 16-core Neural Engine. In practice, the experience is significantly better from iPhone 14 up. On iPhone 12 and 13, load times and generation speed are the most noticeable bottlenecks.
How much storage does Gemma 4 take on the device?
The INT4 quantized mobile variant takes approximately 1.8GB on disk. Add the MediaPipe runtime and your own app on top of that. You're looking at around 2.2–2.5GB of additional space on the device. Not trivial for users with 64GB or 128GB nearly full — like me.
Does data leave the device with local inference?
No. That's the core promise of on-device inference: the text you send to the model and the responses it generates never leave the hardware. There are no calls to any server. For use cases with sensitive data — medical notes, legal documents, private conversations — this is the strongest argument in favor of local inference.
Is it worth it compared to just calling the Gemini API?
Strictly depends on the use case. If the user needs internet anyway to use your app, the cloud API gives better quality, more speed, and zero storage overhead. On-device makes sense when: a) data privacy is critical, b) the use case is genuinely offline, or c) you want zero network latency for very short, simple interactions.
Does this work on Android too?
Yes, and on Android the story is even more fragmented. MediaPipe LLM Inference API supports Android, but the variety of chipsets (Qualcomm, MediaTek, Google Tensor) makes performance vary much more than in Apple's controlled ecosystem. On Pixel 8 with Tensor G3, the numbers are similar to iPhone 13.
How different is this from Apple's Core ML?
Core ML is Apple's framework for on-device ML, but it's designed primarily for specific inference models (image classification, scoped NLP, object detection) — not generative LLMs. Apple Foundation Models in iOS 18 is the direct equivalent, but access is limited and the models are proprietary. MediaPipe with Gemma 4 is the first open-source option with full model control for iOS.
The hardware won the race the software hasn't finished yet
Gemma 4 running natively on iPhone is a real technical milestone. It's not empty hype. The numbers are there, the code runs, and the concrete use cases — though more limited than the headline suggests — exist.
But the most important thing about this moment isn't the model. It's that we're reaching the point where the hardware in the phones already in people's pockets is sufficient for serious local inference. That changes the conversation about privacy, about cloud dependency, about what kinds of apps are possible without internet.
And it puts enormous pressure on Apple to show what it can do when it controls every layer of the stack. If Google can do this with limited hardware access, imagine what Apple should be able to do with full access.
I'm still sitting here with my iPhone 13 with 128GB nearly full and Gemma 4 installed in an Xcode project that probably won't make it to production. But the next time I design an automation routine that processes sensitive text, I'll evaluate on-device before sending data to an external server.
The gap between 'runs' and 'useful' still exists. But it's closing faster than I expected.
If you've tested this on your hardware, tell me what numbers you got — tokens/sec vary quite a bit by chip generation and I want to build my own dataset of real-world performance.
Top comments (0)