DEV Community

System Rationale
System Rationale

Posted on

Gemma 4 on Mobile: Which Model to Load (E2B vs E4B) + Real Implementation Guide

Hey devs πŸ‘‹
I’ve been hands-on with Gemma 4 since it dropped 4 days ago and honestly β€” the E2B and E4B variants are the first models that actually feel practical for real mobile apps.
Here’s the no-BS guide I wish I had: which model to load for your use case + exactly how to load it on Android, iOS, React Native, and web.

  1. Which Gemma 4 model should you actually load?

E2B (β‰ˆ5.1B total params, only 2.3B active thanks to Per-Layer Embeddings)
β†’ Your default for phones.
Use cases: offline tutor, smart replies, chat rephrasing, note summarization, safety filters, anything battery/RAM sensitive.
Cold start is fast, runs smooth on mid-range devices.
E4B (β‰ˆ8B total, 4.5B effective)
β†’ Sweet spot for flagship phones or when you need noticeably better reasoning + native audio + image understanding.
Use cases: multimodal (photo β†’ description), longer context tasks, or when E2B feels a bit β€œlight”.
26B A4B MoE or 31B
β†’ Skip these on mobile. Only for laptops, desktops, or server-side heavy lifting.

Rule of thumb I use: start with E2B. Only bump to E4B if users complain about quality or you need audio/image input.

  1. How to actually load the model (the part that matters) Android

Easiest path: AICore Developer Preview (system-wide Gemma 4, zero weights to ship).
Just call the ML Kit GenAI Prompt API β€” Google handles hardware delegation (NPU/GPU).
For full control in your app: LiteRT-LM
Download the quantized .task file (4-bit) from HF
Use on-demand Play Asset Delivery so your APK stays <100 MB
Load in background with Coroutines β†’ never block UI
Use streaming callback so tokens appear live

iOS

MediaPipe LLM Inference API is the official way.
Convert to MediaPipe task format β†’ memory-map the weights β†’ Metal/MPS acceleration.
Warm up the model during app idle time so first token feels instant.

React Native

Native TurboModule (Kotlin + Swift) is non-negotiable.
Keep the entire model + inference in native code.
Expose only generateResponse(prompt, options) and onToken events back to JS.
Never run inference on the JS thread β€” you will OOM and crash.

Web

MediaPipe + WebGPU (works surprisingly well in Chrome).

Universal tips that saved my ass:

Always use 4-bit quantized version (Q4_K_M or LiteRT equivalent)
Never bundle the full model in the APK/IPA β€” download on first user opt-in
Cap context at 4K–8K for mobile (128K is possible but eats RAM)
Stream tokens. Always. Users hate staring at a blank screen.

Security bonus: because E2B/E4B run 100% offline, user data (exam answers, private notes, photos) never touches your servers. Huge privacy win.
I’m using this exact stack right now for an offline-first tutor app and it’s buttery smooth.
Drop your use case below and I’ll tell you which variant + exact loading path I’d pick for it.
Useful resources (all fresh as of April 2026):

Official Gemma 4 announcement: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
Model card + sizes: https://ai.google.dev/gemma/docs/core/model_card_4
Full model overview (E2B/E4B details): https://ai.google.dev/gemma/docs/core
Android AICore + ML Kit guide: https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html
LiteRT-LM mobile deployment: https://ai.google.dev/edge/litert-lm
Hugging Face E2B/E4B quantized models: https://huggingface.co/google/gemma-4-E2B-it

Who’s actually shipping Gemma 4 on device right now? Show me your stack πŸ™Œ

Top comments (0)