rahul patwa

Posted on Apr 5

Gemma 4 on Android: Complete Developer Guide to On-Device AI

#android #kotlin #ai #programming

Google dropped Gemma 4 on April 2, 2026, under the Apache 2.0 license. The E2B variant 2.58 GB file, under 1.5 GB active memory, runs at 52.1 decode tokens per second on a Samsung S26 Ultra GPU. That's fast enough to stream responses faster than a user can read them, on hardware they already own, with no API key and no data leaving the device. (HuggingFace model cards, 2026)

This is the full integration guide: model selection, Kotlin setup for all three backends, real performance numbers, and the production gotchas I've found building with this stack.

TL;DR: Gemma 4 E2B runs on any Android device with 6 GB+ RAM at 52 tokens/sec on GPU. Add one Gradle dependency (com.google.ai.edge.litertlm:litertlm-android), download the 2.58 GB model file to filesDir, initialize the engine, and stream responses via a Kotlin Flow. Watch out: SamplerConfig takes Double, not Float, and sendMessageAsync emits cumulative text per emission, not deltas. (Google Developers Blog, 2026)

Why This Is Different From Every Other Mobile LLM Attempt

I've tried running quantized models on Android twice before Gemma 4. Both times I hit the same wall: memory pressure killed the process, or inference was too slow to ship. Gemma 4's E2B breaks those assumptions. Its mixed-bit quantization keeps active memory under 1.5 GB, and LiteRT-LM, Google's open-source runtime, released September 24, 2025 - abstracts CPU, GPU, and NPU execution behind a single Kotlin API. (Google Developers Blog, 2025)

The market signal backs the technical case. The on-device AI market hit $10.76 billion in 2025 and is projected to reach $75.5 billion by 2033 at a 27.8% CAGR. (Grand View Research, 2025) That growth is driven by the same forces pushing you to read this post: privacy requirements, latency demands, and the economics of zero marginal inference cost.

The privacy argument closes enterprise deals faster than performance specs do, in my experience. Legal teams understand "data never leaves the device" immediately. Explaining prefill tokens per second takes longer. And 64% of professionals told Cisco in 2025 they worry about inadvertently sharing sensitive data with cloud AI services, so there's a real user-level demand for what this stack offers. (Cisco, 2025)

E2B or E4B? Pick the Right Variant First

Gemma 4 ships two variants for on-device Android deployment. Picking the wrong one costs you either capability or hardware compatibility.

E2B is the right default for most apps. File size is 2.58 GB. Active memory sits under 1.5 GB, which fits any device with 6 GB RAM or more the majority of Android flagships and mid-rangers sold since 2024. Benchmark scores: MMLU Pro at 60.0%, LiveCodeBench v6 at 44.0%, GPQA Diamond at 43.4%. (HuggingFace, 2026) Those numbers are sufficient for summarization, Q&A, writing assistance, and classification.

E4B is for when reasoning quality is the product. File size is 3.65 GB, active memory around 2 GB. The capability jump is real: MMLU Pro rises to 69.4%, LiveCodeBench to 52.0%, and GPQA Diamond (graduate-level science reasoning) jumps from 43.4% to 58.6%. (HuggingFace, 2026) That 15-point GPQA gain matters if you're building legal document review, medical triage support, or code analysis. You'll need 8 GB+ RAM as your device minimum.

GPU decode speed on an S26 Ultra drops from 52.1 tokens/sec (E2B) to 22.1 tokens/sec (E4B). At 22 tokens/sec, streaming still feels natural users read prose at roughly 4-5 words per second. The trade-off is prefill time on long contexts.

Both variants support 140 languages and handle text, image, and audio inputs. (Google DeepMind, 2026) That multimodal capability in a package this small is genuinely surprising. Start with E2B. Only move to E4B if your use case actually demands stronger reasoning.

⚙️ Full Kotlin Integration: All 5 Steps

The full integration is five steps. Each is self-contained. You can have a working demo in under an hour.

Step 1 Add the Gradle Dependency

// app/build.gradle.kts
dependencies {
    implementation("com.google.ai.edge.litertlm:litertlm-android:latest.release")
}

Sync and you're done. No NDK wrangling, no JNI plumbing, no native toolchain configuration at this stage. GPU and NPU backends require additional AndroidManifest.xml entries - covered in the backend section below.

Step 2 Download the Model File

The model doesn't go in your APK. At 2.58 GB (E2B) or 3.65 GB (E4B), it blows past Play Store's 150 MB limit. Download on first launch, store in context.filesDir.

For development and local testing, pull the file directly with the HuggingFace CLI:

# Install the CLI
pip install huggingface_hub

# Download E2B (2.58 GB)
huggingface-cli download litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm --local-dir ./models

# Download E4B (3.65 GB), optional
huggingface-cli download litert-community/gemma-4-E4B-it-litert-lm \
  gemma-4-E4B-it.litertlm --local-dir ./models

For production, implement a WorkManager download job with DownloadManager. Build a proper progress screen with cancellation, resume-on-reconnect, and checksum verification. A 2.58 GB download on a mobile connection takes meaningful time - design accordingly.

Step 3 - Initialize the Engine

import com.google.ai.edge.litertlm.Backend
import com.google.ai.edge.litertlm.Engine
import com.google.ai.edge.litertlm.EngineConfig

val config = EngineConfig(
    modelPath = context.filesDir.absolutePath + "/gemma-4-E2B-it.litertlm",
    backend = Backend.GPU()
)
val engine = Engine(config).also { it.initialize() }

Critical: engine.initialize() is a blocking call. Run it on a background dispatcher (Dispatchers.IO) inside a coroutine. On slower devices with older storage, initialization takes 3-8 seconds. Call it on the main thread and you'll get an ANR before the user even types anything.

Hold the Engine instance in a singleton or a ViewModel. Re-initialization is expensive. Treat it like a database connection open once, reuse everywhere.

Step 4 Create a Conversation

import com.google.ai.edge.litertlm.Content
import com.google.ai.edge.litertlm.Contents
import com.google.ai.edge.litertlm.ConversationConfig
import com.google.ai.edge.litertlm.SamplerConfig

val conversation = engine.createConversation(
    ConversationConfig(
        systemInstruction = Contents.of(Content.Text("You are a helpful assistant.")),
        samplerConfig = SamplerConfig(topK = 40, topP = 0.95, temperature = 0.7)
    )
)

Gotcha: SamplerConfig parameters are Double, not Float. Pass 0.7 not 0.7f. The compiler won't catch this at the call site in all configurations you'll get a runtime error instead. Worth double-checking if you're adapting code from other LLM SDKs that use Float.

Lower temperature toward 0.2 for factual or deterministic responses. Raise it toward 1.0 for creative tasks. topK = 40 with topP = 0.95 is a stable combination that avoids repetition loops in most conversational contexts.

Step 5 Stream Responses

conversation.sendMessageAsync("Explain machine learning in simple terms")
    .collect { message ->
        val text = message.contents.contents
            .filterIsInstance<Content.Text>()
            .joinToString("") { it.text }
        _uiState.update { it.copy(response = text) }
    }

Gotcha: sendMessageAsync emits the cumulative response text on each emission, not just the new delta. Don't concatenate it to your existing string replace it. This design makes ViewModel code cleaner than you'd expect: just assign each emission to your UI state and you get live-updating streaming with no string management on your side.

Wire this Flow into a StateFlow in your ViewModel and collect it in a Composable. The whole streaming chat UI is about 20 lines of production code once the engine is initialized.

CPU, GPU, or NPU: Which Backend Should You Use?

Backend selection has more impact than almost any other integration decision. Here's the tradeoff in concrete numbers.

// CPU works everywhere, no manifest changes required
val config = EngineConfig(modelPath = modelPath, backend = Backend.CPU())

// GPU via OpenCL - fastest prefill, requires manifest entries
val config = EngineConfig(modelPath = modelPath, backend = Backend.GPU())

// NPU Qualcomm Hexagon or MediaTek APU
val config = EngineConfig(
    modelPath = modelPath,
    backend = Backend.NPU(nativeLibraryDir = context.applicationInfo.nativeLibraryDir)
)

CPU: Works on every Android device. On an S26 Ultra, E2B achieves 557 prefill tokens/sec and 46.9 decode tokens/sec using 1,733 MB memory. (HuggingFace model cards, 2026) That's usable for chat. On lower-end hardware it degrades - a mid-range Qualcomm 7s Gen 3 drops to around 8-12 decode tokens/sec. Still viable if you constrain response length.

GPU: 3,808 prefill tokens/sec and 52.1 decode tokens/sec on the S26 Ultra - roughly 7x faster prefill than CPU. (HuggingFace model cards, 2026) The catch: thermal throttling. After 5-10 minutes of continuous GPU inference, the SoC manages heat by dropping clock speeds. You'll see decode fall from 52 to 30 tokens/sec or lower. GPU is the right default for short-to-medium sessions on flagships.

NPU: The Qualcomm Dragonwing IQ8 NPU hits 3,700 prefill and 31 decode tokens/sec. (Google Developers Blog, 2026) Decode trails GPU slightly, but the thermal profile is dramatically better. In my testing on the S25, switching from GPU to NPU at the 8-minute mark kept inference speed consistent across a 20-minute session. For chat-heavy apps where sessions run long, NPU is the better production choice even though GPU wins on peak specs.

Thermal fallback: Implement PowerManager.getThermalHeadroom() to detect high thermal state and gracefully switch backends at runtime. Start on GPU; fall back to NPU if thermal headroom drops below your threshold. This gives you peak speed for short interactions and sustained reliability for long ones.

Real Performance Numbers

These come from official LiteRT-LM model cards on HuggingFace, not synthetic workloads.

Device / Backend	E2B Prefill (tok/s)	E2B Decode (tok/s)	Memory
S26 Ultra - GPU	3,808	52.1	-
S26 Ultra - CPU	557	46.9	1,733 MB
Qualcomm IQ8 NPU	3,700	31.0	-
E4B - S26 Ultra GPU	1,293	22.1	~2 GB active

(HuggingFace model cards, 2026; Google Developers Blog, 2026)

For context on how this compares: running a 7B 4-bit model with llama.cpp on a Snapdragon 8 Gen 3 CPU produces roughly 4 decode tokens/sec. (arXiv 2410.03613v3, 2024) Gemma 4 E2B on CPU already outperforms that by 10x on equivalent-class hardware, because LiteRT-LM is optimized specifically for these architectures.

Gemma 4's architecture also delivers 60% less battery consumption than the previous generation for equivalent tasks. (Google Developers Blog, 2026) In my session-based testing, a 10-minute active conversation on the S25 GPU consumed approximately 4-5% battery, comparable to 10 minutes of YouTube at medium brightness. That's a number you can actually ship on.

One practical note on first inference: cold-start latency is real. The model weights get paged into memory on the first call after engine initialization. Warm the engine with a short empty prompt during app launch - before the user types anything - and that latency disappears from the user experience entirely.

🔥 Production Gotchas

Most guides stop at "here's the code that compiles." Here's what I found the hard way.

Memory initialization order matters. Close large background allocations before calling engine.initialize(). If the user has a game or video editor running, the OS may partially initialize the model and then kill it mid-load. Check available memory first. On 6 GB devices especially, you're working with less headroom than you think.

SamplerConfig takes Double, not Float. Already mentioned above but worth repeating. It's the most common runtime error in early-adopter threads, and it doesn't always surface at compile time.

sendMessageAsync emits cumulative text, not deltas. If you treat each emission as a new chunk and append it, your UI will duplicate everything. Replace the displayed string on each emission, don't concatenate.

The model file goes in filesDir, not the APK. Play Store's APK limit is 150 MB. E2B is 2.58 GB. This isn't optional; you can't ship it in the APK. First-launch download is the standard pattern.

Thermal throttling is real and predictable. GPU inference throttles after 5-10 minutes of sustained load on most flagships. It's not a bug - it's the SoC protecting itself. Design for it: implement PowerManager.getThermalHeadroom() and switch backends proactively rather than waiting for the user to notice slowdown.

Context window growth increases battery drain. Battery consumption increases as conversation length grows because later turns process much longer KV caches than early turns. The 60% battery reduction claim holds for short sessions; it's less pronounced in 20-minute conversations. Consider trimming old turns from history after 10-15 exchanges to control this.

Test on low-RAM devices before shipping. The app behaves completely differently when memory is constrained. Use Android Studio's memory profiler across a full conversation session, not just at startup.

What's Actually Worth Building

The use cases that benefit most from on-device inference share two properties: they involve personal or sensitive data, and they need to feel fast. That combination rules out cloud APIs for entire product categories.

Private journaling and personal assistant apps are the clearest fit. Users who want help organizing notes, reflecting on patterns, or drafting messages have a legitimate expectation of privacy. The on-device guarantee is the product differentiator, not just a feature.

Code review tools for developers. Engineers are already cautious about pasting proprietary code into cloud AI. A local code reviewer removes that concern entirely. With E4B's 52.0% on LiveCodeBench v6, it's capable of explaining code, generating boilerplate, and catching obvious bugs. (HuggingFace, 2026)

Document Q&A. Load a PDF into context and let users ask questions about it. In my prototyping with E2B, a 10-page PDF (~4,000 tokens) processes in under 2 seconds on the S26 Ultra GPU before returning the first response token. That latency is invisible - it feels instant.

Accessibility features : real-time captioning, reading assistance, simplified language mode benefit from local inference because cloud round-trips add 200-400ms per request. That latency matters in accessibility contexts. Local runs immediately.

Offline translation across 140 languages. Field researchers, travelers in areas with poor connectivity, language learning apps that work on airplanes, these are real markets that cloud APIs can't serve reliably. (Google DeepMind, 2026)

FAQ

Does this work on mid-range devices, or only flagships?

E2B requires 6 GB RAM minimum. It runs on any device meeting that bar, not just flagships. On mid-range hardware with less capable GPUs, the CPU backend gives you 8-12 decode tokens/sec. That's usable for short responses. Set user expectations with a progress indicator and constrain response length on lower-tier hardware.

Can I ship the model inside my APK?

No. The E2B file is 2.58 GB. Google Play's APK size limit is 150 MB. The standard pattern is to download the model on first launch using WorkManager and DownloadManager, stored in context.filesDir. It persists across app updates and survives the app going to background during download.

Is Gemma 4 free for commercial use?

Yes. Gemma 4 is released under the Apache 2.0 license free to use, modify, and redistribute commercially with no royalties. LiteRT-LM is also Apache 2.0. There are no usage caps, no API costs, and no restrictions based on your app's MAU count, unlike some competing model licenses. (Google Blog, 2026)

What's the minimum Android API level?

LiteRT-LM requires Android API 26 (Android 8.0 Oreo) or higher for CPU inference. GPU and NPU backends may require higher API levels depending on your device's OpenCL and Hexagon driver versions. Test GPU and NPU support on your specific target devices don't assume it works based on API level alone.

How do I handle the thermal throttling in practice?

Call PowerManager.getThermalHeadroom() periodically during long inference sessions. If it drops below 0.5, switch from GPU to NPU. If it drops below 0.2, fall back to CPU or pause inference with a brief user message. The NPU backend dissipates heat significantly more efficiently than GPU switching at the 8-minute mark keeps decode speed consistent across sessions running 20 minutes or longer.

Conclusion

Gemma 4 plus LiteRT-LM is the first stack where "on-device LLM in a production Android app" goes from interesting experiment to shippable product. The numbers that make it real: 52.1 decode tokens/sec on GPU, under 1.5 GB active memory for E2B, Apache 2.0 license, and a Kotlin API that reduces the full integration to five steps and a few hundred lines of code.

The gotchas are real SamplerConfig takes Double, sendMessageAsync emits cumulative text, the model goes in filesDir not the APK, and thermal throttling will surprise you if you don't plan for it. Plan for them up front and none of them are blockers.

Start with E2B on GPU. Get a conversation running in a minimal Compose screen first, before touching production infrastructure. Profile memory and battery across a full session in Android Studio. Add the NPU fallback once the basics are solid.

The model files are at huggingface.co/litert-community. The LiteRT-LM runtime docs are at ai.google.dev/edge/litert. Both are worth bookmarking, they're updated as the runtime evolves.

Happy to answer questions in the comments about specific integration scenarios. What are you building with this?

DEV Community