DEV Community: Manoj Shetty

5 things `flutter_gemma` doesn't tell you about shipping Gemma 4 on Android

Manoj Shetty — Sun, 24 May 2026 04:59:41 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I shipped a Gemma 4 assistant on Android in 17 days. Voice, vision, RAG, eight device actions, everything offline once the model is on device. The project is called PocketClaw, and you can read about it here if that's what you came for.

This post is about the 5 things I had to figure out the hard way. Not in the flutter_gemma README. Not in Google's MediaPipe docs. Not in any of the half-dozen "run Gemma on Android" tutorials I read this week.

If you're about to ship Gemma 4 on Android, I hope this saves you a weekend.

📱 Companion post: How I built PocketClaw — a fully offline AI assistant on Android with Gemma 4 E2B. Demo video, architecture deep-dive, full source code.

1. Small models drop facts buried mid-prompt. Put what matters at the top.

I've been building agents on Claude and GPT-4 for about 18 months. Both of them handle a long system prompt fine. You can mix instructions and facts in any order, and the model figures out what's a fact versus what's a behavior rule.

Gemma 4 E2B doesn't.

My first system prompt for PocketClaw looked like this:

You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.

User asks "what's my name?" Claw answers "I do not know your name." Every time. The name is sitting right there in sentence three.

I spent about an hour staring at this. My theory after the fact is that "never restate the question" was acting as a dominant instruction, and the model generalized it to "don't reference any user context." That's the kind of overgeneralization a 2B-effective model does. Cloud LLMs don't.

The fix was structural, not lexical:

final namePart = (userName != null && userName.trim().isNotEmpty)
    ? "The name of the user is ${userName.trim()}.\n\n"
    : '';
final systemPreamble = '${namePart}You are Claw, ...';

Same fact. Moved to the first line of the prompt. On its own. Flat declarative sentence. Worked the first time I tried it.

The lesson generalizes. When you're working with a 2B-class model, anything you actually want the model to remember goes at the front of the prompt, in simple sentences, with no competing instructions in the same paragraph. The model is paying way more attention to the opening than to the middle. Treat the system prompt like a slot-filling template, not a paragraph.

2. Vanilla RAG breaks on the queries users actually type.

If you've worked with RAG before, you know the textbook setup. Chunk the document, embed the chunks, store the vectors, embed the query at retrieval time, find the closest matches. Works great on benchmarks where the queries are specific.

It doesn't work on "summarise the document."

I caught this on Friday. I'd been heads-down for two weeks. I thought I had a working build. I uploaded a PDF I had lying around (a paper on edge LLMs), typed "summarise the document", hit send. Claw said "Summarize the document." back to me. Just that. Like an echo. I tried twice more with different phrasing. Same answer.

Eventually I typed "summarise llmaiedge.pdf" with the actual filename. Got a real summary.

The problem is obvious once you see it. "Summarise the document" has zero semantic overlap with the document text. The PDF doesn't contain the words "summarise" or "this document." Cosine similarity returns nothing useful. The retrieved chunk list is empty. Gemma gets the user's question with no actual document context attached. So it does what any LLM does when starved for context. It hallucinates a generic answer from its training data.

The fix I shipped is two heuristics deep:

final isGenericIntent = hits.length <= 1 && (
    lower.contains('summari') ||
    lower.contains('tldr') ||
    lower.contains('explain') ||
    lower.contains('describe') ||
    lower.contains('the document') ||
    lower.contains('the pdf')
);

if (isGenericIntent) {
    hits = await RagService.instance.getDocStarts(
        conversationId: _conversation.id,
    );
}

getDocStarts is a small fallback method. It runs searchSimilar once per indexed doc, using each doc's filename as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question.

Two lines of conditional logic. The difference between "Claw can summarise PDFs" and "Claw echoes your question back at you."

If you're building RAG on Gemma 4 (or any small on-device model really), test it with generic queries before you ship. Your textbook similarity search will fail in ways that look like the model is broken.

3. The flutter_gemma plugin bundles ~33 MB of native libraries you probably don't use.

Stock APK for PocketClaw came out at 185 MB. That felt heavy.

When I unzipped it and looked at the native libs (arm64-v8a only, I was already shipping a single-arch build), this is what I saw:

26 MB  libllm_inference_engine_jni.so       (needed)
24 MB  libLiteRtLm.so                       (needed)
17 MB  libgemma_embedding_model_jni.so      (don't use — using Gecko)
17 MB  libgecko_embedding_model_jni.so      (needed)
14 MB  libmediapipe_tasks_vision_jni.so     (needed — vision input)
14 MB  libmediapipe_tasks_vision_image_generator_jni.so  (NOT USED)
10 MB  libimagegenerator_gpu.so             (NOT USED)
8  MB  libLiteRtGpuAccelerator.so           (needed)
8  MB  libLiteRtWebGpuAccelerator.so        (NOT USED — Android has OpenCL)
9  MB  libtext_chunker_jni.so               (needed)

The image-generation libs are for using Gemma to generate images. PocketClaw only consumes images (vision input to multimodal Gemma). I'm never going to generate. The WebGPU accelerator is for browsers — Android uses OpenCL. None of it does anything on my target platform.

Four lines in android/app/build.gradle.kts:

packaging {
    jniLibs {
        excludes.addAll(listOf(
            "**/libimagegenerator_gpu.so",
            "**/libmediapipe_tasks_vision_image_generator_jni.so",
            "**/libLiteRtWebGpuAccelerator.so",
            "**/libLiteRtTopKWebGpuSampler.so"
        ))
    }
}

APK dropped from 185 MB to 152 MB. 33 MB cut. Vision input still works, embedder still works, inference still works.

If your use case is similar (chat + vision input + RAG, no image generation), copy these excludes. If your use case is different — say you actually want Gemma to generate images — leave the image-gen libs in. The point is, audit what your plugin pulls in and exclude what you don't use. flutter_gemma is built for general capability surface, not minimum-bytes-on-device.

There's a second-order point here that matters more. MediaPipe is the reason flutter_gemma is so big. It's also the reason it handles vision and (in 3n's case) audio. Text-focused alternatives like llama.cpp wrappers can ship at 30-60 MB on Android but with much more limited or no multimodal coverage today. So the choice is really: 152 MB with mature vision support, or 60 MB without. There's no free lunch where you get multimodal at the size of a text-only stack. Pick based on what your product actually needs.

4. Don't feed the 128K context window. Compact it.

Gemma 4 has a 128K context window. Sounds great in theory. In practice it's a footgun.

Every token in the prompt costs latency at decode time. Every token costs RAM. On a phone, both of those are tight. If you naively shove the whole chat history into context every turn, you'll find that turn 20 takes noticeably longer than turn 5, and turn 50 might OOM the app.

PocketClaw keeps a sliding window of the most recent 24 messages in their raw form. Anything older runs through a compaction pass:

Extract explicit facts the user has stated ("I am X", "Remember Y", "My name is Z").
Capture unresolved goals (keywords like "fix", "todo", "issue").
Compile both into a single lightweight summary paragraph that gets prepended to the prompt as memory.

That's the chat part. The aggressive part is image handling.

A typical user-uploaded photo is roughly 1 MB. As base64 in a prompt, that's around 30,000 tokens. That's a quarter of the entire 128K context window for one image. If the user uploads three photos across a conversation, your context budget is in trouble.

So PocketClaw does this: when an image message slides past the 24-message boundary, the raw image bytes get deleted from memory. What's preserved is the assistant's prior textual description of that image:

String _imageMemoryFromAssistant({
  required String? imageName,
  required String assistantText,
}) {
  final label = imageName ?? 'uploaded image';
  return 'Assistant previously described $label as: '
         '${_shorten(assistantText, 1000)}';
}

So Claw still "remembers" what it saw, but only the description goes through the prompt. The 30,000-token base64 blob becomes a 100-token summary. That's roughly a 300x compression of image memory.

The kicker: this works really well for the kind of follow-up questions users actually ask. "What was in that photo I sent earlier?" is answerable from the description alone. The model rarely needs the pixels back. If it does, the user can re-upload.

The general pattern here is: don't think of the context window as "free space up to the limit." Think of it as a budget. Spend it on things the model needs for the current turn. Everything else gets compacted to a textual summary.

5. Whether audio works in `flutter_gemma` depends on your model file, not the Gemma version.

This one I want you to know so you don't spend a day chasing the wrong thing like I did.

Gemma 4 E2B's model card lists native audio as a supported modality. So I thought: cool, I'll skip the speech-to-text plugin entirely, feed raw audio bytes straight to Gemma, get a single multimodal call instead of an STT-then-LLM pipeline. Cleaner.

I dug into flutter_gemma v0.15.1 source. The plugin's documentation consistently frames audio as a Gemma 3n E4B feature:

/// [supportAudio] — whether the model supports audio (Gemma 3n E4B only).
bool supportAudio = false,

That phrasing shows up in eight different files in the plugin. The interface, the API docs, the example app, the native Android side — they all treat audio as 3n territory.

But here's what's interesting once you read further. There is no hardcoded model-version check anywhere in the plugin. The actual gate is just if (config.supportAudio == true). So what's really limiting audio isn't the Dart code rejecting Gemma 4 — it's whether the model file you downloaded actually contains the audio encoder.

The example app's model.dart has the clearest hint:

supportAudio: true,  // .litertlm files have TF_LITE_AUDIO_ENCODER
supportAudio: false, // .task files don't have audio encoder

So the real question for any model you want to use with audio isn't "is it Gemma 3n?" — it's "does my .litertlm file include the audio encoder for this model?" The plugin's docs assume the answer is yes only for Gemma 3n E4B because that's what's been tested and shipped that way. For Gemma 4 E2B, the model card says audio is supported by the model itself, but I haven't found a .litertlm build of E2B that bundles the encoder. If one ships, the plugin should handle it — there's no version-gate to stop it.

For PocketClaw I went with Android's system STT (the speech_to_text package). Practical reasons. I get live transcription as the user speaks (text appears word by word while they're holding the mic). That's a noticeably better UX than the "hold, speak, release, wait" pattern you'd get from on-model audio. And it side-steps the question of whether my specific E2B .litertlm file has the encoder.

The takeaway: read your plugin's source before you trust its capability flags. And read it carefully enough to separate documentation framing from actual gating logic. The plugin's docs say "Gemma 3n E4B only" eight times — but the code itself doesn't enforce that. If you have an E2B build with the audio encoder, it's worth testing.

Closing

Five patterns. None of them are in the README. None of them are in Google's docs. I learned all of them by shipping something real and watching it fail in interesting ways.

If you're building on Gemma 4 for Android, these will save you time. If you want to see all five running together in a real app, PocketClaw is fully open source, MIT licensed.

The thing I keep coming back to, after 17 days with Gemma 4 E2B on a mid-range Android phone, is how capable a 2B model can be when it's running fast on the user's device. The latency feels different from cloud. There's no perceived "AI thinking" delay because there's no network. It just answers at the speed of your phone.

That's worth optimizing for.

Editorial assistance from Claude. All code, decisions, bugs, and engineering choices in this post are mine.

How I built a fully offline AI assistant on Android with Gemma 4 E2B

Manoj Shetty — Sun, 24 May 2026 04:43:53 +0000

How I built a fully offline AI assistant on Android with Gemma 4 E2B

This is a submission for the Gemma 4 Challenge: Build with Gemma 4.

What I built

PocketClaw is an Android assistant that runs entirely on your phone. You can chat with it, talk to it (press and hold the mic, it transcribes live), show it photos, hand it a PDF and ask questions about it, or tell it to turn on the flashlight, set an alarm, open the dialer, send an SMS, drop something on the calendar, search the web, or fire a notification. All of that runs on a 1.5 GB model that lives on the device.

Once the model is downloaded the first time, you can switch on airplane mode. Nothing breaks.

I built it solo, 17 days, for this challenge.

Repo: github.com/ManojRakshu/pocketclaw (MIT)

Why on-device and why E2B

I've been building agents on cloud LLMs for about a year and a half. Claude mostly, GPT-4 for a few things. Every agent I've put in front of users has had the same set of problems sitting behind it. Latency adds up when you're chaining calls. The cost per call gets real once you have real traffic. And the whole thing stops the second the network goes down.

Phones are interesting because they fix all three of those at once. Model lives on the device, so there's no per-call cost. No network in the loop, so latency is just silicon. And the network can disappear without anything breaking.

The thing that constrains you is RAM. A mid-range Android phone gives an app something like 1.5 to 2 GB of usable working memory before the OS starts pushing back. That's enough to rule out most of the Gemma 4 family:

E2B at about 1.5 GB INT4. Fits. Has vision built in. This is what I shipped.
E4B at about 2.5 GB. Tight on high-end phones, OOMs on lower-end ones.
26B MoE. Workstation.
31B dense. Server.

I built and tested on a OnePlus Nord CE 4 (Snapdragon 7s Gen 3, Adreno GPU, 8 GB RAM, Android 14). First-token latency is around 1 to 3 seconds for chat, around 5 for vision. Slower than cloud. But there's no network in the way.

The 1.5 GB number is worth being precise about. Full precision E2B (fp32) is roughly 20 GB. fp16 is 10. INT8 is 5. INT4 with Google's litert-lm packaging gets you down to 1.5 GB. Same precision class as what Google ships in Pixel's Gemini Nano. For phones, INT4 is the only answer that makes sense.

How it's put together

Three layers.

The model. I'm using flutter_gemma, which wraps Google's MediaPipe LLM API and LiteRT-LM on Android. I picked it because among the Flutter plugins I evaluated, it had the most mature support for Gemma 4's vision input. There's a cost to this. The MediaPipe stack adds about 80 MB of native libraries to the APK. Lighter, text-focused alternatives (llama.cpp wrappers, etc.) can ship at 30-60 MB. My release APK is 152 MB, down from 185 after I trimmed image-generation and WebGPU runtimes via Gradle (the plugin bundles them, we never use them):

// android/app/build.gradle.kts
packaging {
    jniLibs {
        excludes.addAll(listOf(
            // We never generate images, only consume them.
            "**/libimagegenerator_gpu.so",
            "**/libmediapipe_tasks_vision_image_generator_jni.so",
            // WebGPU is for browsers, useless on Android.
            "**/libLiteRtWebGpuAccelerator.so",
            "**/libLiteRtTopKWebGpuSampler.so"
        ))
    }
}

If you don't need vision you can probably get below 80 MB by switching to a text-focused stack. I needed vision so I'm at 152.

RAG on-device. A second model handles embeddings. Gecko 110M, around 110 MB on disk. I went with Gecko over EmbeddingGemma 300M because Gecko is roughly 3x smaller and the retrieval quality on PDFs of a hundred pages or less was comparable. Could be different at larger corpora. The pipeline is Syncfusion for PDF extraction, my own chunker (paragraph split, merge tinies, sentence-aware sub-split for anything still over the threshold), Gecko for embedding, sqlite-vec with HNSW for the vector store. All on device.

Device actions. Gemma's job here is intent classification. The user types or says "set an alarm for 7:30 AM". Gemma emits a structured JSON object that identifies the tool and parameters. Dart parses it. A native Kotlin MethodChannel (pocketclaw/device) fires the right Android intent. Eight categories work this way. Flashlight via CameraManager.setTorchMode. Alarms via AlarmClock.ACTION_SET_ALARM. Dialer via ACTION_DIAL. SMS via ACTION_SENDTO. Calendar via ACTION_INSERT. Location settings panel. Web search by handing the query to the default browser. Local notifications via NotificationManager. Nothing in the loop touches the network. The LLM doesn't even know the network exists.

Things that broke

RAG dies on generic queries

Vanilla RAG works fine for specific questions. Someone uploads a PDF about PocketClaw, asks "who built PocketClaw", retrieval picks up a chunk that contains my name, Gemma summarises, done.

It falls over on the queries people actually type. I caught this Friday afternoon. I'd shipped what I thought was a working build. I uploaded llmaiedge.pdf (a PDF about edge LLMs I had lying around), typed "summarise the document", hit send. Claw answered with "Summarize the document." That's it. I tried twice more with different phrasing. Same answer. Eventually I typed "summarise llmaiedge.pdf" and got a real response. The filename was doing the work, not my retrieval.

The problem is that "summarise this doc" has no semantic overlap with the actual document text. The doc doesn't contain the words "summarise" or "this doc." Cosine similarity returns nothing useful, the prompt goes to Gemma with no real context, and Gemma fills in with whatever its training data feels like saying about generic documents.

The fix runs two heuristics:

final isGenericIntent = hits.length <= 1 && (
    lower.contains('summari') ||
    lower.contains('tldr') ||
    lower.contains('explain') ||
    lower.contains('describe') ||
    lower.contains('the document') ||
    lower.contains('the pdf')
    // ... a few more
);

if (isGenericIntent) {
    // Fall back to searching with each indexed doc's filename as the query.
    hits = await RagService.instance.getDocStarts(
        conversationId: _conversation.id,
    );
}

getDocStarts runs searchSimilar once per indexed doc, using the filename itself as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question. Two lines of conditional logic, and the difference between a broken demo and a working one.

Small models drop facts buried mid-prompt

I wanted Claw to remember the user's name. Onboarding asks for it, prefs stores it, the system prompt includes it. User asks "what's my name?" Claw says "I do not know your name."

The first version of my system prompt looked like this:

You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.

The third sentence has the name. Gemma 4 E2B dropped it completely. I burned about an hour staring at this before I figured out what was happening. My theory is that "never restate the question" was acting as a dominant instruction that generalized to "don't reference user context at all." Small models do that. Cloud LLMs don't.

The fix was to move the fact to the top of the prompt:

final namePart = (userName != null && userName.trim().isNotEmpty)
    ? "The name of the user is ${userName.trim()}.\n\n"
    : '';
final systemPreamble = '${namePart}You are Claw, ...';

Same information. First line of the prompt, on its own, in a flat declarative sentence. Worked first try.

Lesson I've now learned twice. With small models, what you want the model to know goes at the front, in simple sentences, without competing instructions next to it. Cloud LLMs respect the whole prompt. 2B models don't.

Audio in flutter_gemma is framed for Gemma 3n, not Gemma 4

I wanted to skip the speech-to-text plugin entirely. Just feed audio bytes directly to Gemma 4 E2B's audio modality. The model card says E2B supports audio. Cleaner architecture, one fewer dependency.

I went to dig into flutter_gemma v0.15.1 source. Eight different files in the plugin frame audio as a Gemma 3n E4B feature, including this in the interface:

/// [supportAudio] — whether the model supports audio (Gemma 3n E4B only).
bool supportAudio = false,

The example app makes the real constraint clearer: .litertlm files that bundle TF_LITE_AUDIO_ENCODER work with supportAudio: true, .task files don't. So the actual limit isn't the Gemma version — it's whether your specific model file contains the audio encoder. I haven't found a Gemma 4 E2B .litertlm build that bundles the audio encoder, and the plugin's docs treat audio as 3n-territory.

So PocketClaw uses Android system STT via the speech_to_text package, which is a wrapper around RecognizerIntent. Side benefit: I get live transcription as the user speaks, which is genuinely a better UX than the "hold, release, wait three seconds, see both the transcription and the response" pattern you'd get from on-model audio. When a Gemma 4 E2B .litertlm build with audio encoder ships, the voice path collapses into a single multimodal call. Not blocking on v1.

Long chats and memory

Gemma 4 has a 128K context window. That's plenty in theory. In practice, every token costs latency and RAM, so I'd rather not feed the whole history every turn.

PocketClaw keeps the most recent 24 messages in full text. Anything older runs through a compaction pass:

Extract facts the user has stated explicitly ("I am X", "My name is Y", "Remember Z").
Capture unresolved goals (keywords like "fix", "todo", "issue").
Compile the whole thing into a single lightweight summary paragraph that gets prepended to the prompt as memory.

The more aggressive part: when an image message slides past the 24-message boundary, the raw image bytes get deleted. What stays is the assistant's prior textual description of that image:

String _imageMemoryFromAssistant({
  required String? imageName,
  required String assistantText,
}) {
  final label = imageName ?? 'uploaded image';
  return 'Assistant previously described $label as: ${_shorten(assistantText, 1000)}';
}

Claw still "remembers" what it saw, but without the bytes weighing on the prompt. This matters more than it sounds like it should. A 1 MB photo as base64 is around 30K tokens. The textual description of the same image is around 100. So this is roughly a 300x compression of image memory, with surprisingly little loss for the kinds of follow-up questions users actually ask.

Stack

Flutter 3.41 / Dart 3.11. Chat model is Gemma 4 E2B INT4 at 1.5 GB, loaded through flutter_gemma on top of MediaPipe LLM and LiteRT-LM. Embedder is Gecko 110M (110 MB) for RAG. Vector store is sqlite-vec with HNSW, on device. Speech-to-text is the system STT via speech_to_text. Eight device actions go through a Kotlin MethodChannel. Hive for state (separate boxes for conversations, documents, prefs). syncfusion_flutter_pdf for PDF extraction. The theme is custom neobrutalistic dark with hard borders, monospace type, and cyan/purple/mint accents. I wanted it to feel like a piece of hardware, not a Material 3 chat app.

What didn't make v1

A floating overlay bubble (Android 13+ accessibility-style overlay) was in the original design. I cut it because the overlay permission UX has a long tail I didn't want to ship rough.

Direct Gemma 4 audio input. Waiting on a Gemma 4 E2B .litertlm build with the audio encoder. v2 when that's available.

Per-message RAG toggle. Right now if you have a document attached, every message in that conversation retrieves against it. Sometimes users want to ask an unrelated follow-up without doc context. v2.

Chunk citations in responses. Claw answers from retrieved chunks but doesn't surface which chunk. The retrieval data is sitting there. It's a UI add. v2.

What I learned

Most "run an LLM on your phone" tutorials I've read this year stop at model loading. flutter_gemma handles loading in about 10 lines. That part is easy.

The work is in everything around the model. The compaction so long chats don't OOM. The RAG fallbacks for queries that don't fit the textbook similarity-search assumption. The native channels for actual device actions. The Gradle excludes for plugin libs you don't need. The prompt structure that gets a 2B model to follow instructions reliably. The error handling for when the model file is half downloaded and the user opens the app anyway.

What surprised me about Gemma 4 E2B, coming from cloud models, is how capable a 2B model can be when it's running fast on your own device. Vision captioning is genuinely useful. Intent classification across 8 tool categories works well enough to ship. There's no perceived "AI thinking" delay because there's no network. The model speaks at the speed of your phone.

For a 1.5 GB download, that's a real deal.

Repo: github.com/ManojRakshu/pocketclaw

License: MIT

Tested on: OnePlus Nord CE 4 (Snapdragon 7s Gen 3, 8 GB RAM, Android 14)

Demo recorded: May 24, 2026

If you want to look at the design system, RAG service, or compaction engine, the source is fully open. PRs welcome.

Editorial assistance from Claude. All code, decisions, bugs, and engineering choices in this post are mine.

DEV Community: Manoj Shetty

5 things `flutter_gemma` doesn't tell you about shipping Gemma 4 on Android

1. Small models drop facts buried mid-prompt. Put what matters at the top.

2. Vanilla RAG breaks on the queries users actually type.

3. The flutter_gemma plugin bundles ~33 MB of native libraries you probably don't use.

4. Don't feed the 128K context window. Compact it.

5. Whether audio works in flutter_gemma depends on your model file, not the Gemma version.

Closing

How I built a fully offline AI assistant on Android with Gemma 4 E2B

How I built a fully offline AI assistant on Android with Gemma 4 E2B

What I built

Why on-device and why E2B

How it's put together

Things that broke

RAG dies on generic queries

Small models drop facts buried mid-prompt

Audio in flutter_gemma is framed for Gemma 3n, not Gemma 4

Long chats and memory

Stack

What didn't make v1

What I learned

5. Whether audio works in `flutter_gemma` depends on your model file, not the Gemma version.