Hey devs π
Iβve been hands-on with Gemma 4 since it dropped 4 days ago and honestly β the E2B and E4B variants are the first models that actually feel practical for real mobile apps.
Hereβs the no-BS guide I wish I had: which model to load for your use case + exactly how to load it on Android, iOS, React Native, and web.
- Which Gemma 4 model should you actually load?
E2B (β5.1B total params, only 2.3B active thanks to Per-Layer Embeddings)
β Your default for phones.
Use cases: offline tutor, smart replies, chat rephrasing, note summarization, safety filters, anything battery/RAM sensitive.
Cold start is fast, runs smooth on mid-range devices.
E4B (β8B total, 4.5B effective)
β Sweet spot for flagship phones or when you need noticeably better reasoning + native audio + image understanding.
Use cases: multimodal (photo β description), longer context tasks, or when E2B feels a bit βlightβ.
26B A4B MoE or 31B
β Skip these on mobile. Only for laptops, desktops, or server-side heavy lifting.
Rule of thumb I use: start with E2B. Only bump to E4B if users complain about quality or you need audio/image input.
- How to actually load the model (the part that matters) Android
Easiest path: AICore Developer Preview (system-wide Gemma 4, zero weights to ship).
Just call the ML Kit GenAI Prompt API β Google handles hardware delegation (NPU/GPU).
For full control in your app: LiteRT-LM
Download the quantized .task file (4-bit) from HF
Use on-demand Play Asset Delivery so your APK stays <100 MB
Load in background with Coroutines β never block UI
Use streaming callback so tokens appear live
iOS
MediaPipe LLM Inference API is the official way.
Convert to MediaPipe task format β memory-map the weights β Metal/MPS acceleration.
Warm up the model during app idle time so first token feels instant.
React Native
Native TurboModule (Kotlin + Swift) is non-negotiable.
Keep the entire model + inference in native code.
Expose only generateResponse(prompt, options) and onToken events back to JS.
Never run inference on the JS thread β you will OOM and crash.
Web
MediaPipe + WebGPU (works surprisingly well in Chrome).
Universal tips that saved my ass:
Always use 4-bit quantized version (Q4_K_M or LiteRT equivalent)
Never bundle the full model in the APK/IPA β download on first user opt-in
Cap context at 4Kβ8K for mobile (128K is possible but eats RAM)
Stream tokens. Always. Users hate staring at a blank screen.
Security bonus: because E2B/E4B run 100% offline, user data (exam answers, private notes, photos) never touches your servers. Huge privacy win.
Iβm using this exact stack right now for an offline-first tutor app and itβs buttery smooth.
Drop your use case below and Iβll tell you which variant + exact loading path Iβd pick for it.
Useful resources (all fresh as of April 2026):
Official Gemma 4 announcement: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
Model card + sizes: https://ai.google.dev/gemma/docs/core/model_card_4
Full model overview (E2B/E4B details): https://ai.google.dev/gemma/docs/core
Android AICore + ML Kit guide: https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html
LiteRT-LM mobile deployment: https://ai.google.dev/edge/litert-lm
Hugging Face E2B/E4B quantized models: https://huggingface.co/google/gemma-4-E2B-it
Whoβs actually shipping Gemma 4 on device right now? Show me your stack π
Top comments (0)