This is a submission for the Gemma 4 Challenge: Write About Gemma 4
I shipped a Gemma 4 assistant on Android in 17 days. Voice, vision, RAG, eight device actions, everything offline once the model is on device. The project is called PocketClaw, and you can read about it here if that's what you came for.
This post is about the 5 things I had to figure out the hard way. Not in the flutter_gemma README. Not in Google's MediaPipe docs. Not in any of the half-dozen "run Gemma on Android" tutorials I read this week.
If you're about to ship Gemma 4 on Android, I hope this saves you a weekend.
📱 Companion post: How I built PocketClaw — a fully offline AI assistant on Android with Gemma 4 E2B. Demo video, architecture deep-dive, full source code.
1. Small models drop facts buried mid-prompt. Put what matters at the top.
I've been building agents on Claude and GPT-4 for about 18 months. Both of them handle a long system prompt fine. You can mix instructions and facts in any order, and the model figures out what's a fact versus what's a behavior rule.
Gemma 4 E2B doesn't.
My first system prompt for PocketClaw looked like this:
You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.
User asks "what's my name?" Claw answers "I do not know your name." Every time. The name is sitting right there in sentence three.
I spent about an hour staring at this. My theory after the fact is that "never restate the question" was acting as a dominant instruction, and the model generalized it to "don't reference any user context." That's the kind of overgeneralization a 2B-effective model does. Cloud LLMs don't.
The fix was structural, not lexical:
final namePart = (userName != null && userName.trim().isNotEmpty)
? "The name of the user is ${userName.trim()}.\n\n"
: '';
final systemPreamble = '${namePart}You are Claw, ...';
Same fact. Moved to the first line of the prompt. On its own. Flat declarative sentence. Worked the first time I tried it.
The lesson generalizes. When you're working with a 2B-class model, anything you actually want the model to remember goes at the front of the prompt, in simple sentences, with no competing instructions in the same paragraph. The model is paying way more attention to the opening than to the middle. Treat the system prompt like a slot-filling template, not a paragraph.
2. Vanilla RAG breaks on the queries users actually type.
If you've worked with RAG before, you know the textbook setup. Chunk the document, embed the chunks, store the vectors, embed the query at retrieval time, find the closest matches. Works great on benchmarks where the queries are specific.
It doesn't work on "summarise the document."
I caught this on Friday. I'd been heads-down for two weeks. I thought I had a working build. I uploaded a PDF I had lying around (a paper on edge LLMs), typed "summarise the document", hit send. Claw said "Summarize the document." back to me. Just that. Like an echo. I tried twice more with different phrasing. Same answer.
Eventually I typed "summarise llmaiedge.pdf" with the actual filename. Got a real summary.
The problem is obvious once you see it. "Summarise the document" has zero semantic overlap with the document text. The PDF doesn't contain the words "summarise" or "this document." Cosine similarity returns nothing useful. The retrieved chunk list is empty. Gemma gets the user's question with no actual document context attached. So it does what any LLM does when starved for context. It hallucinates a generic answer from its training data.
The fix I shipped is two heuristics deep:
final isGenericIntent = hits.length <= 1 && (
lower.contains('summari') ||
lower.contains('tldr') ||
lower.contains('explain') ||
lower.contains('describe') ||
lower.contains('the document') ||
lower.contains('the pdf')
);
if (isGenericIntent) {
hits = await RagService.instance.getDocStarts(
conversationId: _conversation.id,
);
}
getDocStarts is a small fallback method. It runs searchSimilar once per indexed doc, using each doc's filename as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question.
Two lines of conditional logic. The difference between "Claw can summarise PDFs" and "Claw echoes your question back at you."
If you're building RAG on Gemma 4 (or any small on-device model really), test it with generic queries before you ship. Your textbook similarity search will fail in ways that look like the model is broken.
3. The flutter_gemma plugin bundles ~33 MB of native libraries you probably don't use.
Stock APK for PocketClaw came out at 185 MB. That felt heavy.
When I unzipped it and looked at the native libs (arm64-v8a only, I was already shipping a single-arch build), this is what I saw:
26 MB libllm_inference_engine_jni.so (needed)
24 MB libLiteRtLm.so (needed)
17 MB libgemma_embedding_model_jni.so (don't use — using Gecko)
17 MB libgecko_embedding_model_jni.so (needed)
14 MB libmediapipe_tasks_vision_jni.so (needed — vision input)
14 MB libmediapipe_tasks_vision_image_generator_jni.so (NOT USED)
10 MB libimagegenerator_gpu.so (NOT USED)
8 MB libLiteRtGpuAccelerator.so (needed)
8 MB libLiteRtWebGpuAccelerator.so (NOT USED — Android has OpenCL)
9 MB libtext_chunker_jni.so (needed)
The image-generation libs are for using Gemma to generate images. PocketClaw only consumes images (vision input to multimodal Gemma). I'm never going to generate. The WebGPU accelerator is for browsers — Android uses OpenCL. None of it does anything on my target platform.
Four lines in android/app/build.gradle.kts:
packaging {
jniLibs {
excludes.addAll(listOf(
"**/libimagegenerator_gpu.so",
"**/libmediapipe_tasks_vision_image_generator_jni.so",
"**/libLiteRtWebGpuAccelerator.so",
"**/libLiteRtTopKWebGpuSampler.so"
))
}
}
APK dropped from 185 MB to 152 MB. 33 MB cut. Vision input still works, embedder still works, inference still works.
If your use case is similar (chat + vision input + RAG, no image generation), copy these excludes. If your use case is different — say you actually want Gemma to generate images — leave the image-gen libs in. The point is, audit what your plugin pulls in and exclude what you don't use. flutter_gemma is built for general capability surface, not minimum-bytes-on-device.
There's a second-order point here that matters more. MediaPipe is the reason flutter_gemma is so big. It's also the reason it does vision and audio at all. The llama.cpp-based alternatives ship at 30-60 MB on Android but skip multimodal entirely. So the choice is really: 152 MB with vision, or 60 MB without. There's no free lunch where you get multimodal for the size of a text-only stack. Pick based on what your product actually needs.
4. Don't feed the 128K context window. Compact it.
Gemma 4 has a 128K context window. Sounds great in theory. In practice it's a footgun.
Every token in the prompt costs latency at decode time. Every token costs RAM. On a phone, both of those are tight. If you naively shove the whole chat history into context every turn, you'll find that turn 20 takes noticeably longer than turn 5, and turn 50 might OOM the app.
PocketClaw keeps a sliding window of the most recent 24 messages in their raw form. Anything older runs through a compaction pass:
- Extract explicit facts the user has stated ("I am X", "Remember Y", "My name is Z").
- Capture unresolved goals (keywords like "fix", "todo", "issue").
- Compile both into a single lightweight summary paragraph that gets prepended to the prompt as memory.
That's the chat part. The aggressive part is image handling.
A typical user-uploaded photo is roughly 1 MB. As base64 in a prompt, that's around 30,000 tokens. That's a quarter of the entire 128K context window for one image. If the user uploads three photos across a conversation, your context budget is in trouble.
So PocketClaw does this: when an image message slides past the 24-message boundary, the raw image bytes get deleted from memory. What's preserved is the assistant's prior textual description of that image:
String _imageMemoryFromAssistant({
required String? imageName,
required String assistantText,
}) {
final label = imageName ?? 'uploaded image';
return 'Assistant previously described $label as: '
'${_shorten(assistantText, 1000)}';
}
So Claw still "remembers" what it saw, but only the description goes through the prompt. The 30,000-token base64 blob becomes a 100-token summary. That's roughly a 300x compression of image memory.
The kicker: this works really well for the kind of follow-up questions users actually ask. "What was in that photo I sent earlier?" is answerable from the description alone. The model rarely needs the pixels back. If it does, the user can re-upload.
The general pattern here is: don't think of the context window as "free space up to the limit." Think of it as a budget. Spend it on things the model needs for the current turn. Everything else gets compacted to a textual summary.
5. Native audio in flutter_gemma is gated on Gemma 3n, not Gemma 4.
This one I want you to know so you don't waste a day like I did.
Gemma 4 E2B's model card lists native audio as a supported modality. So I thought: cool, I'll skip the speech-to-text plugin entirely, feed raw audio bytes straight to Gemma, get a single multimodal call instead of an STT-then-LLM pipeline. Cleaner.
I dug into flutter_gemma v0.15.1 source to find the audio API. Saw this comment in the interface:
/// [supportAudio] — whether the model supports audio (Gemma 3n E4B only).
bool supportAudio = false,
The plugin's audio code path is gated on Gemma 3n. If you set supportAudio: true while loading Gemma 4 E2B, you'll either get a load error or a silent failure at inference time. The native side does support audio (the C++ engine handles it fine), but the Dart-side check rejects it for non-3n models.
So PocketClaw uses Android's system STT (the speech_to_text package, which is a wrapper around RecognizerIntent). Side benefit: I get live transcription as the user is speaking. The text appears in the input field word by word while they're holding the mic. That's a noticeably better UX than the "hold the button, speak, release, wait three seconds while audio uploads and processes, see both your words and the AI response" pattern you'd get from on-model audio.
When (if?) flutter_gemma exposes audio for E2B, the path collapses. Until then, system STT plus text-mode Gemma is the right architecture.
The takeaway isn't "audio is broken." It's: read your plugin's source before you trust its capability flags. Especially for multimodal features that span Gemma versions. The model can do something doesn't mean the plugin's wrapped it for your model.
Closing
Five patterns. None of them are in the README. None of them are in Google's docs. I learned all of them by shipping something real and watching it fail in interesting ways.
If you're building on Gemma 4 for Android, these will save you time. If you want to see all five running together in a real app, PocketClaw is fully open source, MIT licensed.
The thing I keep coming back to, after 17 days with Gemma 4 E2B on a mid-range Android phone, is how capable a 2B model can be when it's running fast on the user's device. The latency feels different from cloud. There's no perceived "AI thinking" delay because there's no network. It just answers at the speed of your phone.
That's worth optimizing for.
Editorial assistance from Claude. All code, decisions, bugs, and engineering choices in this post are mine.
Top comments (0)