DEV Community

Cover image for Running AI Fully Offline on Mobile with Gemma 4 (Android + iOS)
System Rationale
System Rationale

Posted on

Running AI Fully Offline on Mobile with Gemma 4 (Android + iOS)

I used to think “AI in apps” meant calling an API.
Then I tried running the model inside the app itself.

No network. No latency spikes. No sending user data anywhere.

That’s when things started to feel… different.

Why this shift actually matters

Most mobile AI apps today work like this:

User → App → API → Cloud Model → Response

Which means:
• unpredictable latency
• ongoing cost
• user data leaves the device

Now compare that with:

User → App → Local Model → Response

No round trips. No dependency.

That’s what Gemma 4 enables with its edge-optimized variants (E2B / E4B).

⚙️ How you actually run it on mobile

There are two real paths here. Everything else is noise.

  1. Android (Best path): System-level AI via AICore

If you’re targeting modern Android:
• The model runs as part of the system
• You don’t bundle anything heavy
• OS handles optimization (CPU/GPU scheduling)

👉 This is the cleanest architecture:
• smaller APK
• better performance
• less maintenance

  1. Cross-platform: MediaPipe / AI Edge (Android + iOS)

This is where most devs will start.

You:
• download a Gemma model (optimized format)
• run it locally via inference API
• stream responses into your UI

What the code actually looks like

Let’s keep it real — not pseudo code.

🔹 Android (Kotlin)

val llm = LlmInference.createFromOptions(
    context = appContext,
    options = LlmInferenceOptions.builder()
        .setModelPath("/data/user/0/app/files/gemma-4-E2B.litertlm")
        .setMaxTokens(256)
        .build()
)

// Run inference off main thread
CoroutineScope(Dispatchers.IO).launch {
    val response = llm.generateResponse("Explain event-driven architecture simply")

    withContext(Dispatchers.Main) {
        textView.text = response
    }
}

Enter fullscreen mode Exit fullscreen mode

👉 Important:
• Never run this on main thread
• Keep responses streamed if possible

🔹 iOS (Swift)

let llm = try LlmInference(
    modelPath: "gemma-4-E2B.litertlm",
    maxTokens: 256
)

DispatchQueue.global(qos: .userInitiated).async {
    let response = try? llm.generateResponse(
        input: "Explain microservices vs monolith"
    )

    DispatchQueue.main.async {
        self.outputLabel.text = response
    }
}

Enter fullscreen mode Exit fullscreen mode

👉 Same rule applies:
• background execution is mandatory
• UI must stay responsive

⚡ Performance reality (this is where most fail)

Let’s be honest — running LLMs on phones is not “free”.

  1. Model size is your first constraint

Even optimized models:
• can be hundreds of MB

👉 Practical approach:
• Default → E2B
• Optional → E4B (for high-end devices)

  1. First response latency matters more than speed

Users don’t care about tokens/sec.
They care about:

“How fast did I get the first answer?”

👉 Fix:
• warm up model with a tiny prompt
• preload when app is idle

  1. GPU / Metal is not optional

If you rely only on CPU:
• performance drops hard
• battery drains faster

👉 Always enable:
• GPU backend (Android)
• Metal (iOS)

  1. Threading mistakes will break your app

If you run inference on UI thread:
• frame drops
• ANRs
• crashes

👉 Treat model inference like:
• network call
• or heavy computation

Privacy becomes a feature (finally)

This is the part most people underestimate.

When everything runs locally:
• user input stays on device
• no logs
• no external dependency

👉 This unlocks real use cases:
• private note summarization
• personal AI assistants
• sensitive chat analysis
• offline learning tools

App size strategy (critical decision)

This is where many implementations go wrong.

❌ Don’t do this
• bundle model inside APK/IPA
• force download during install

👉 You’ll kill install conversion.

✅ Do this instead
• download model after user opts in
• store in app-specific storage
• allow deletion / re-download

🧠 Even better (Android)

If available:
• use system model (AICore)

👉 Zero model shipping
👉 Zero storage overhead

** Where this actually makes sense**

Not every app needs on-device AI.

But for these, it’s a serious advantage:
• EdTech (offline tutor, quizzes)
• Productivity (notes, summaries)
• Messaging (privacy-first features)
• Dating apps (local intelligence, no data leak)

⚠️ Hard truth

This is not magic.

Avoid if:
• targeting low-end devices
• need heavy multi-agent orchestration
• require massive context windows

🚀 What changed for me

After experimenting with this setup, one thing became clear:

The future of mobile AI isn’t “better APIs”
It’s “less APIs”

🔚 So We’re moving from:

“Send data → wait → receive response”

to:

“Compute locally → respond instantly”

And the teams that design for this early
will build products that feel fundamentally faster and more trustworthy.

👋 If you’re building in this space

I’m building an AI-powered learning system in public.

Sharing:
• what I build
• what breaks
• what actually scales

If that’s your space too → let’s connect.

Top comments (0)