I used to think “AI in apps” meant calling an API.
Then I tried running the model inside the app itself.
No network. No latency spikes. No sending user data anywhere.
That’s when things started to feel… different.
Why this shift actually matters
Most mobile AI apps today work like this:
User → App → API → Cloud Model → Response
Which means:
• unpredictable latency
• ongoing cost
• user data leaves the device
Now compare that with:
User → App → Local Model → Response
No round trips. No dependency.
That’s what Gemma 4 enables with its edge-optimized variants (E2B / E4B).
⚙️ How you actually run it on mobile
There are two real paths here. Everything else is noise.
- Android (Best path): System-level AI via AICore
If you’re targeting modern Android:
• The model runs as part of the system
• You don’t bundle anything heavy
• OS handles optimization (CPU/GPU scheduling)
👉 This is the cleanest architecture:
• smaller APK
• better performance
• less maintenance
- Cross-platform: MediaPipe / AI Edge (Android + iOS)
This is where most devs will start.
You:
• download a Gemma model (optimized format)
• run it locally via inference API
• stream responses into your UI
What the code actually looks like
Let’s keep it real — not pseudo code.
🔹 Android (Kotlin)
val llm = LlmInference.createFromOptions(
context = appContext,
options = LlmInferenceOptions.builder()
.setModelPath("/data/user/0/app/files/gemma-4-E2B.litertlm")
.setMaxTokens(256)
.build()
)
// Run inference off main thread
CoroutineScope(Dispatchers.IO).launch {
val response = llm.generateResponse("Explain event-driven architecture simply")
withContext(Dispatchers.Main) {
textView.text = response
}
}
👉 Important:
• Never run this on main thread
• Keep responses streamed if possible
🔹 iOS (Swift)
let llm = try LlmInference(
modelPath: "gemma-4-E2B.litertlm",
maxTokens: 256
)
DispatchQueue.global(qos: .userInitiated).async {
let response = try? llm.generateResponse(
input: "Explain microservices vs monolith"
)
DispatchQueue.main.async {
self.outputLabel.text = response
}
}
👉 Same rule applies:
• background execution is mandatory
• UI must stay responsive
⚡ Performance reality (this is where most fail)
Let’s be honest — running LLMs on phones is not “free”.
- Model size is your first constraint
Even optimized models:
• can be hundreds of MB
👉 Practical approach:
• Default → E2B
• Optional → E4B (for high-end devices)
- First response latency matters more than speed
Users don’t care about tokens/sec.
They care about:
“How fast did I get the first answer?”
👉 Fix:
• warm up model with a tiny prompt
• preload when app is idle
- GPU / Metal is not optional
If you rely only on CPU:
• performance drops hard
• battery drains faster
👉 Always enable:
• GPU backend (Android)
• Metal (iOS)
- Threading mistakes will break your app
If you run inference on UI thread:
• frame drops
• ANRs
• crashes
👉 Treat model inference like:
• network call
• or heavy computation
Privacy becomes a feature (finally)
This is the part most people underestimate.
When everything runs locally:
• user input stays on device
• no logs
• no external dependency
👉 This unlocks real use cases:
• private note summarization
• personal AI assistants
• sensitive chat analysis
• offline learning tools
App size strategy (critical decision)
This is where many implementations go wrong.
❌ Don’t do this
• bundle model inside APK/IPA
• force download during install
👉 You’ll kill install conversion.
✅ Do this instead
• download model after user opts in
• store in app-specific storage
• allow deletion / re-download
🧠 Even better (Android)
If available:
• use system model (AICore)
👉 Zero model shipping
👉 Zero storage overhead
** Where this actually makes sense**
Not every app needs on-device AI.
But for these, it’s a serious advantage:
• EdTech (offline tutor, quizzes)
• Productivity (notes, summaries)
• Messaging (privacy-first features)
• Dating apps (local intelligence, no data leak)
⚠️ Hard truth
This is not magic.
Avoid if:
• targeting low-end devices
• need heavy multi-agent orchestration
• require massive context windows
🚀 What changed for me
After experimenting with this setup, one thing became clear:
The future of mobile AI isn’t “better APIs”
It’s “less APIs”
🔚 So We’re moving from:
“Send data → wait → receive response”
to:
“Compute locally → respond instantly”
And the teams that design for this early
will build products that feel fundamentally faster and more trustworthy.
👋 If you’re building in this space
I’m building an AI-powered learning system in public.
Sharing:
• what I build
• what breaks
• what actually scales
If that’s your space too → let’s connect.
Top comments (0)