<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manoj Shetty</title>
    <description>The latest articles on DEV Community by Manoj Shetty (@manoj_shetty).</description>
    <link>https://dev.to/manoj_shetty</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3913708%2Ff4a6e22b-1dea-4250-9abf-42859d132a7c.jpg</url>
      <title>DEV Community: Manoj Shetty</title>
      <link>https://dev.to/manoj_shetty</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manoj_shetty"/>
    <language>en</language>
    <item>
      <title>5 things `flutter_gemma` doesn't tell you about shipping Gemma 4 on Android</title>
      <dc:creator>Manoj Shetty</dc:creator>
      <pubDate>Sun, 24 May 2026 04:59:41 +0000</pubDate>
      <link>https://dev.to/manoj_shetty/5-things-fluttergemma-doesnt-tell-you-about-shipping-gemma-4-on-android-2koj</link>
      <guid>https://dev.to/manoj_shetty/5-things-fluttergemma-doesnt-tell-you-about-shipping-gemma-4-on-android-2koj</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I shipped a Gemma 4 assistant on Android in 17 days. Voice, vision, RAG, eight device actions, everything offline once the model is on device. The project is called PocketClaw, and you can read about it &lt;a href="https://dev.to/manoj_shetty/how-i-built-a-fully-offline-ai-assistant-on-android-with-gemma-4-e2b-5a19"&gt;here&lt;/a&gt; if that's what you came for.&lt;/p&gt;

&lt;p&gt;This post is about the 5 things I had to figure out the hard way. Not in the &lt;code&gt;flutter_gemma&lt;/code&gt; README. Not in Google's MediaPipe docs. Not in any of the half-dozen "run Gemma on Android" tutorials I read this week.&lt;/p&gt;

&lt;p&gt;If you're about to ship Gemma 4 on Android, I hope this saves you a weekend.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📱 &lt;strong&gt;Companion post:&lt;/strong&gt; &lt;a href="https://dev.to/manoj_shetty/how-i-built-a-fully-offline-ai-assistant-on-android-with-gemma-4-e2b-5a19"&gt;How I built PocketClaw — a fully offline AI assistant on Android with Gemma 4 E2B&lt;/a&gt;. Demo video, architecture deep-dive, full source code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. Small models drop facts buried mid-prompt. Put what matters at the top.
&lt;/h2&gt;

&lt;p&gt;I've been building agents on Claude and GPT-4 for about 18 months. Both of them handle a long system prompt fine. You can mix instructions and facts in any order, and the model figures out what's a fact versus what's a behavior rule.&lt;/p&gt;

&lt;p&gt;Gemma 4 E2B doesn't.&lt;/p&gt;

&lt;p&gt;My first system prompt for PocketClaw looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;User asks "what's my name?" Claw answers "I do not know your name." Every time. The name is sitting right there in sentence three.&lt;/p&gt;

&lt;p&gt;I spent about an hour staring at this. My theory after the fact is that "never restate the question" was acting as a dominant instruction, and the model generalized it to "don't reference any user context." That's the kind of overgeneralization a 2B-effective model does. Cloud LLMs don't.&lt;/p&gt;

&lt;p&gt;The fix was structural, not lexical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;namePart&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userName&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isNotEmpty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s"&gt;"The name of the user is &lt;/span&gt;&lt;span class="si"&gt;${userName.trim()}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
    &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;systemPreamble&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="si"&gt;${namePart}&lt;/span&gt;&lt;span class="s"&gt;You are Claw, ...'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same fact. Moved to the first line of the prompt. On its own. Flat declarative sentence. Worked the first time I tried it.&lt;/p&gt;

&lt;p&gt;The lesson generalizes. When you're working with a 2B-class model, anything you actually want the model to remember goes at the front of the prompt, in simple sentences, with no competing instructions in the same paragraph. The model is paying way more attention to the opening than to the middle. Treat the system prompt like a slot-filling template, not a paragraph.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Vanilla RAG breaks on the queries users actually type.
&lt;/h2&gt;

&lt;p&gt;If you've worked with RAG before, you know the textbook setup. Chunk the document, embed the chunks, store the vectors, embed the query at retrieval time, find the closest matches. Works great on benchmarks where the queries are specific.&lt;/p&gt;

&lt;p&gt;It doesn't work on "summarise the document."&lt;/p&gt;

&lt;p&gt;I caught this on Friday. I'd been heads-down for two weeks. I thought I had a working build. I uploaded a PDF I had lying around (a paper on edge LLMs), typed "summarise the document", hit send. Claw said "Summarize the document." back to me. Just that. Like an echo. I tried twice more with different phrasing. Same answer.&lt;/p&gt;

&lt;p&gt;Eventually I typed "summarise llmaiedge.pdf" with the actual filename. Got a real summary.&lt;/p&gt;

&lt;p&gt;The problem is obvious once you see it. "Summarise the document" has zero semantic overlap with the document text. The PDF doesn't contain the words "summarise" or "this document." Cosine similarity returns nothing useful. The retrieved chunk list is empty. Gemma gets the user's question with no actual document context attached. So it does what any LLM does when starved for context. It hallucinates a generic answer from its training data.&lt;/p&gt;

&lt;p&gt;The fix I shipped is two heuristics deep:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;isGenericIntent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;length&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'summari'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'tldr'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'explain'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'describe'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'the document'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'the pdf'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isGenericIntent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;RagService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;instance&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getDocStarts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nl"&gt;conversationId:&lt;/span&gt; &lt;span class="n"&gt;_conversation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;getDocStarts&lt;/code&gt; is a small fallback method. It runs &lt;code&gt;searchSimilar&lt;/code&gt; once per indexed doc, using each doc's filename as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question.&lt;/p&gt;

&lt;p&gt;Two lines of conditional logic. The difference between "Claw can summarise PDFs" and "Claw echoes your question back at you."&lt;/p&gt;

&lt;p&gt;If you're building RAG on Gemma 4 (or any small on-device model really), test it with generic queries before you ship. Your textbook similarity search will fail in ways that look like the model is broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The flutter_gemma plugin bundles ~33 MB of native libraries you probably don't use.
&lt;/h2&gt;

&lt;p&gt;Stock APK for PocketClaw came out at 185 MB. That felt heavy.&lt;/p&gt;

&lt;p&gt;When I unzipped it and looked at the native libs (arm64-v8a only, I was already shipping a single-arch build), this is what I saw:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;26 MB  libllm_inference_engine_jni.so       (needed)
24 MB  libLiteRtLm.so                       (needed)
17 MB  libgemma_embedding_model_jni.so      (don't use — using Gecko)
17 MB  libgecko_embedding_model_jni.so      (needed)
14 MB  libmediapipe_tasks_vision_jni.so     (needed — vision input)
14 MB  libmediapipe_tasks_vision_image_generator_jni.so  (NOT USED)
10 MB  libimagegenerator_gpu.so             (NOT USED)
8  MB  libLiteRtGpuAccelerator.so           (needed)
8  MB  libLiteRtWebGpuAccelerator.so        (NOT USED — Android has OpenCL)
9  MB  libtext_chunker_jni.so               (needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The image-generation libs are for using Gemma to &lt;em&gt;generate&lt;/em&gt; images. PocketClaw only &lt;em&gt;consumes&lt;/em&gt; images (vision input to multimodal Gemma). I'm never going to generate. The WebGPU accelerator is for browsers — Android uses OpenCL. None of it does anything on my target platform.&lt;/p&gt;

&lt;p&gt;Four lines in &lt;code&gt;android/app/build.gradle.kts&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="nf"&gt;packaging&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;jniLibs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;excludes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;listOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;"**/libimagegenerator_gpu.so"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"**/libmediapipe_tasks_vision_image_generator_jni.so"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"**/libLiteRtWebGpuAccelerator.so"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"**/libLiteRtTopKWebGpuSampler.so"&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;APK dropped from 185 MB to 152 MB. 33 MB cut. Vision input still works, embedder still works, inference still works.&lt;/p&gt;

&lt;p&gt;If your use case is similar (chat + vision input + RAG, no image generation), copy these excludes. If your use case is different — say you actually want Gemma to generate images — leave the image-gen libs in. The point is, audit what your plugin pulls in and exclude what you don't use. &lt;code&gt;flutter_gemma&lt;/code&gt; is built for general capability surface, not minimum-bytes-on-device.&lt;/p&gt;

&lt;p&gt;There's a second-order point here that matters more. MediaPipe is the reason &lt;code&gt;flutter_gemma&lt;/code&gt; is so big. It's also the reason it does vision and audio at all. The llama.cpp-based alternatives ship at 30-60 MB on Android but skip multimodal entirely. So the choice is really: 152 MB with vision, or 60 MB without. There's no free lunch where you get multimodal for the size of a text-only stack. Pick based on what your product actually needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Don't feed the 128K context window. Compact it.
&lt;/h2&gt;

&lt;p&gt;Gemma 4 has a 128K context window. Sounds great in theory. In practice it's a footgun.&lt;/p&gt;

&lt;p&gt;Every token in the prompt costs latency at decode time. Every token costs RAM. On a phone, both of those are tight. If you naively shove the whole chat history into context every turn, you'll find that turn 20 takes noticeably longer than turn 5, and turn 50 might OOM the app.&lt;/p&gt;

&lt;p&gt;PocketClaw keeps a sliding window of the most recent 24 messages in their raw form. Anything older runs through a compaction pass:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract explicit facts the user has stated ("I am X", "Remember Y", "My name is Z").&lt;/li&gt;
&lt;li&gt;Capture unresolved goals (keywords like "fix", "todo", "issue").&lt;/li&gt;
&lt;li&gt;Compile both into a single lightweight summary paragraph that gets prepended to the prompt as memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's the chat part. The aggressive part is image handling.&lt;/p&gt;

&lt;p&gt;A typical user-uploaded photo is roughly 1 MB. As base64 in a prompt, that's around 30,000 tokens. That's a quarter of the entire 128K context window for one image. If the user uploads three photos across a conversation, your context budget is in trouble.&lt;/p&gt;

&lt;p&gt;So PocketClaw does this: when an image message slides past the 24-message boundary, the raw image bytes get deleted from memory. What's preserved is the assistant's prior textual description of that image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="nf"&gt;_imageMemoryFromAssistant&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="kd"&gt;required&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;imageName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="kd"&gt;required&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="n"&gt;assistantText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imageName&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="s"&gt;'uploaded image'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;'Assistant previously described &lt;/span&gt;&lt;span class="si"&gt;$label&lt;/span&gt;&lt;span class="s"&gt; as: '&lt;/span&gt;
         &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="si"&gt;${_shorten(assistantText, 1000)}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So Claw still "remembers" what it saw, but only the description goes through the prompt. The 30,000-token base64 blob becomes a 100-token summary. That's roughly a 300x compression of image memory.&lt;/p&gt;

&lt;p&gt;The kicker: this works really well for the kind of follow-up questions users actually ask. "What was in that photo I sent earlier?" is answerable from the description alone. The model rarely needs the pixels back. If it does, the user can re-upload.&lt;/p&gt;

&lt;p&gt;The general pattern here is: don't think of the context window as "free space up to the limit." Think of it as a budget. Spend it on things the model needs &lt;em&gt;for the current turn&lt;/em&gt;. Everything else gets compacted to a textual summary.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Native audio in &lt;code&gt;flutter_gemma&lt;/code&gt; is gated on Gemma 3n, not Gemma 4.
&lt;/h2&gt;

&lt;p&gt;This one I want you to know so you don't waste a day like I did.&lt;/p&gt;

&lt;p&gt;Gemma 4 E2B's model card lists native audio as a supported modality. So I thought: cool, I'll skip the speech-to-text plugin entirely, feed raw audio bytes straight to Gemma, get a single multimodal call instead of an STT-then-LLM pipeline. Cleaner.&lt;/p&gt;

&lt;p&gt;I dug into &lt;code&gt;flutter_gemma&lt;/code&gt; v0.15.1 source to find the audio API. Saw this comment in the interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="c1"&gt;/// [supportAudio] — whether the model supports audio (Gemma 3n E4B only).&lt;/span&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;supportAudio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The plugin's audio code path is gated on Gemma 3n. If you set &lt;code&gt;supportAudio: true&lt;/code&gt; while loading Gemma 4 E2B, you'll either get a load error or a silent failure at inference time. The native side does support audio (the C++ engine handles it fine), but the Dart-side check rejects it for non-3n models.&lt;/p&gt;

&lt;p&gt;So PocketClaw uses Android's system STT (the &lt;code&gt;speech_to_text&lt;/code&gt; package, which is a wrapper around &lt;code&gt;RecognizerIntent&lt;/code&gt;). Side benefit: I get live transcription as the user is speaking. The text appears in the input field word by word while they're holding the mic. That's a noticeably better UX than the "hold the button, speak, release, wait three seconds while audio uploads and processes, see both your words and the AI response" pattern you'd get from on-model audio.&lt;/p&gt;

&lt;p&gt;When (if?) &lt;code&gt;flutter_gemma&lt;/code&gt; exposes audio for E2B, the path collapses. Until then, system STT plus text-mode Gemma is the right architecture.&lt;/p&gt;

&lt;p&gt;The takeaway isn't "audio is broken." It's: &lt;strong&gt;read your plugin's source before you trust its capability flags.&lt;/strong&gt; Especially for multimodal features that span Gemma versions. The model can do something doesn't mean the plugin's wrapped it for your model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Five patterns. None of them are in the README. None of them are in Google's docs. I learned all of them by shipping something real and watching it fail in interesting ways.&lt;/p&gt;

&lt;p&gt;If you're building on Gemma 4 for Android, these will save you time. If you want to see all five running together in a real app, &lt;a href="https://dev.to/manoj_shetty/how-i-built-a-fully-offline-ai-assistant-on-android-with-gemma-4-e2b-5a19"&gt;PocketClaw&lt;/a&gt; is fully open source, MIT licensed.&lt;/p&gt;

&lt;p&gt;The thing I keep coming back to, after 17 days with Gemma 4 E2B on a mid-range Android phone, is how capable a 2B model can be when it's running fast on the user's device. The latency feels different from cloud. There's no perceived "AI thinking" delay because there's no network. It just answers at the speed of your phone.&lt;/p&gt;

&lt;p&gt;That's worth optimizing for.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Editorial assistance from Claude. All code, decisions, bugs, and engineering choices in this post are mine.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>flutter</category>
    </item>
    <item>
      <title>How I built a fully offline AI assistant on Android with Gemma 4 E2B</title>
      <dc:creator>Manoj Shetty</dc:creator>
      <pubDate>Sun, 24 May 2026 04:43:53 +0000</pubDate>
      <link>https://dev.to/manoj_shetty/how-i-built-a-fully-offline-ai-assistant-on-android-with-gemma-4-e2b-5a19</link>
      <guid>https://dev.to/manoj_shetty/how-i-built-a-fully-offline-ai-assistant-on-android-with-gemma-4-e2b-5a19</guid>
      <description>&lt;h1&gt;
  
  
  How I built a fully offline AI assistant on Android with Gemma 4 E2B
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;PocketClaw is an Android assistant that runs entirely on your phone. You can chat with it, talk to it (press and hold the mic, it transcribes live), show it photos, hand it a PDF and ask questions about it, or tell it to turn on the flashlight, set an alarm, open the dialer, send an SMS, drop something on the calendar, search the web, or fire a notification. All of that runs on a 1.5 GB model that lives on the device.&lt;/p&gt;

&lt;p&gt;Once the model is downloaded the first time, you can switch on airplane mode. Nothing breaks.&lt;/p&gt;

&lt;p&gt;I built it solo, 17 days, for this challenge.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/CknufqBTQtQ"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/ManojRakshu/pocketclaw" rel="noopener noreferrer"&gt;github.com/ManojRakshu/pocketclaw&lt;/a&gt; (MIT)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fed2afz0izxhlk9asbbqh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fed2afz0izxhlk9asbbqh.jpg" alt="Onboarding" width="800" height="1787"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why on-device and why E2B
&lt;/h2&gt;

&lt;p&gt;I've been building agents on cloud LLMs for about a year and a half. Claude mostly, GPT-4 for a few things. Every agent I've put in front of users has had the same set of problems sitting behind it. Latency adds up when you're chaining calls. The cost per call gets real once you have real traffic. And the whole thing stops the second the network goes down.&lt;/p&gt;

&lt;p&gt;Phones are interesting because they fix all three of those at once. Model lives on the device, so there's no per-call cost. No network in the loop, so latency is just silicon. And the network can disappear without anything breaking.&lt;/p&gt;

&lt;p&gt;The thing that constrains you is RAM. A mid-range Android phone gives an app something like 1.5 to 2 GB of usable working memory before the OS starts pushing back. That's enough to rule out most of the Gemma 4 family:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;E2B&lt;/strong&gt; at about 1.5 GB INT4. Fits. Has vision built in. This is what I shipped.&lt;/li&gt;
&lt;li&gt;E4B at about 2.5 GB. Tight on high-end phones, OOMs on lower-end ones.&lt;/li&gt;
&lt;li&gt;26B MoE. Workstation.&lt;/li&gt;
&lt;li&gt;31B dense. Server.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I built and tested on a OnePlus Nord CE 4 (Snapdragon 7s Gen 3, Adreno GPU, 8 GB RAM, Android 14). First-token latency is around 1 to 3 seconds for chat, around 5 for vision. Slower than cloud. But there's no network in the way.&lt;/p&gt;

&lt;p&gt;The 1.5 GB number is worth being precise about. Full precision E2B (fp32) is roughly 20 GB. fp16 is 10. INT8 is 5. INT4 with Google's litert-lm packaging gets you down to 1.5 GB. Same precision class as what Google ships in Pixel's Gemini Nano. For phones, INT4 is the only answer that makes sense.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwi1uwe12th6rt88qjks.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwi1uwe12th6rt88qjks.jpg" alt="Setup" width="800" height="1787"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How it's put together
&lt;/h2&gt;

&lt;p&gt;Three layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model.&lt;/strong&gt; I'm using &lt;code&gt;flutter_gemma&lt;/code&gt;, which wraps Google's MediaPipe LLM API and LiteRT-LM on Android. I picked it because it's the only Flutter plugin I could find that handles vision input on Gemma 4 natively. Most alternatives are llama.cpp ports and skip multimodal completely. There's a cost to this. The MediaPipe stack adds about 80 MB of native libraries to the APK. Llama.cpp ships at 30-60 MB but with no images. My release APK is 152 MB, down from 185 after I trimmed image-generation and WebGPU runtimes via Gradle (the plugin bundles them, we never use them):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="c1"&gt;// android/app/build.gradle.kts&lt;/span&gt;
&lt;span class="nf"&gt;packaging&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;jniLibs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;excludes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;listOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="c1"&gt;// We never generate images, only consume them.&lt;/span&gt;
            &lt;span class="s"&gt;"**/libimagegenerator_gpu.so"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"**/libmediapipe_tasks_vision_image_generator_jni.so"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;// WebGPU is for browsers, useless on Android.&lt;/span&gt;
            &lt;span class="s"&gt;"**/libLiteRtWebGpuAccelerator.so"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"**/libLiteRtTopKWebGpuSampler.so"&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you don't need vision you can probably get below 80 MB by switching off MediaPipe completely. I needed vision so I'm at 152.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG on-device.&lt;/strong&gt; A second model handles embeddings. Gecko 110M, around 110 MB on disk. I went with Gecko over EmbeddingGemma 300M because Gecko is roughly 3x smaller and the retrieval quality on PDFs of a hundred pages or less was comparable. Could be different at larger corpora. The pipeline is Syncfusion for PDF extraction, my own chunker (paragraph split, merge tinies, sentence-aware sub-split for anything still over the threshold), Gecko for embedding, sqlite-vec with HNSW for the vector store. All on device.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Device actions.&lt;/strong&gt; Gemma's job here is intent classification. The user types or says "set an alarm for 7:30 AM". Gemma emits a structured JSON object that identifies the tool and parameters. Dart parses it. A native Kotlin MethodChannel (&lt;code&gt;pocketclaw/device&lt;/code&gt;) fires the right Android intent. Eight categories work this way. Flashlight via &lt;code&gt;CameraManager.setTorchMode&lt;/code&gt;. Alarms via &lt;code&gt;AlarmClock.ACTION_SET_ALARM&lt;/code&gt;. Dialer via &lt;code&gt;ACTION_DIAL&lt;/code&gt;. SMS via &lt;code&gt;ACTION_SENDTO&lt;/code&gt;. Calendar via &lt;code&gt;ACTION_INSERT&lt;/code&gt;. Location settings panel. Web search by handing the query to the default browser. Local notifications via &lt;code&gt;NotificationManager&lt;/code&gt;. Nothing in the loop touches the network. The LLM doesn't even know the network exists.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszz2qcic6hb0q4z3wmfc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszz2qcic6hb0q4z3wmfc.jpg" alt="Loading" width="800" height="1787"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Things that broke
&lt;/h2&gt;

&lt;h3&gt;
  
  
  RAG dies on generic queries
&lt;/h3&gt;

&lt;p&gt;Vanilla RAG works fine for specific questions. Someone uploads a PDF about PocketClaw, asks "who built PocketClaw", retrieval picks up a chunk that contains my name, Gemma summarises, done.&lt;/p&gt;

&lt;p&gt;It falls over on the queries people actually type. I caught this Friday afternoon. I'd shipped what I thought was a working build. I uploaded &lt;code&gt;llmaiedge.pdf&lt;/code&gt; (a PDF about edge LLMs I had lying around), typed "summarise the document", hit send. Claw answered with "Summarize the document." That's it. I tried twice more with different phrasing. Same answer. Eventually I typed "summarise llmaiedge.pdf" and got a real response. The filename was doing the work, not my retrieval.&lt;/p&gt;

&lt;p&gt;The problem is that "summarise this doc" has no semantic overlap with the actual document text. The doc doesn't contain the words "summarise" or "this doc." Cosine similarity returns nothing useful, the prompt goes to Gemma with no real context, and Gemma fills in with whatever its training data feels like saying about generic documents.&lt;/p&gt;

&lt;p&gt;The fix runs two heuristics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;isGenericIntent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;length&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'summari'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'tldr'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'explain'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'describe'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'the document'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
    &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'the pdf'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;// ... a few more&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isGenericIntent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Fall back to searching with each indexed doc's filename as the query.&lt;/span&gt;
    &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;RagService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;instance&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getDocStarts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nl"&gt;conversationId:&lt;/span&gt; &lt;span class="n"&gt;_conversation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;getDocStarts&lt;/code&gt; runs &lt;code&gt;searchSimilar&lt;/code&gt; once per indexed doc, using the filename itself as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question. Two lines of conditional logic, and the difference between a broken demo and a working one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Small models drop facts buried mid-prompt
&lt;/h3&gt;

&lt;p&gt;I wanted Claw to remember the user's name. Onboarding asks for it, prefs stores it, the system prompt includes it. User asks "what's my name?" Claw says "I do not know your name."&lt;/p&gt;

&lt;p&gt;The first version of my system prompt looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The third sentence has the name. Gemma 4 E2B dropped it completely. I burned about an hour staring at this before I figured out what was happening. My theory is that "never restate the question" was acting as a dominant instruction that generalized to "don't reference user context at all." Small models do that. Cloud LLMs don't.&lt;/p&gt;

&lt;p&gt;The fix was to move the fact to the top of the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;namePart&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userName&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;userName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isNotEmpty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s"&gt;"The name of the user is &lt;/span&gt;&lt;span class="si"&gt;${userName.trim()}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
    &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;systemPreamble&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="si"&gt;${namePart}&lt;/span&gt;&lt;span class="s"&gt;You are Claw, ...'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same information. First line of the prompt, on its own, in a flat declarative sentence. Worked first try.&lt;/p&gt;

&lt;p&gt;Lesson I've now learned twice. With small models, what you want the model to know goes at the front, in simple sentences, without competing instructions next to it. Cloud LLMs respect the whole prompt. 2B models don't.&lt;/p&gt;

&lt;h3&gt;
  
  
  Audio in flutter_gemma is gated on Gemma 3n, not Gemma 4
&lt;/h3&gt;

&lt;p&gt;I wanted to skip the speech-to-text plugin entirely. Just feed audio bytes directly to Gemma 4 E2B's audio modality. The model card says E2B supports audio. Cleaner architecture, one fewer dependency.&lt;/p&gt;

&lt;p&gt;I went to dig into &lt;code&gt;flutter_gemma&lt;/code&gt; v0.15.1 source to figure out how the audio API works. Found this comment in the interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="c1"&gt;/// [supportAudio] — whether the model supports audio (Gemma 3n E4B only).&lt;/span&gt;
&lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;supportAudio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The plugin's audio path is gated on Gemma 3n. If you send audio bytes to E2B you'll get an error or a silent failure. So PocketClaw uses Android system STT via the &lt;code&gt;speech_to_text&lt;/code&gt; package, which is a wrapper around &lt;code&gt;RecognizerIntent&lt;/code&gt;. Side benefit: I get live transcription as the user speaks, which is genuinely a better UX than the "hold, release, wait three seconds, see both the transcription and the response" pattern you'd get from on-model audio. When &lt;code&gt;flutter_gemma&lt;/code&gt; exposes E2B audio, the voice path collapses into a single multimodal call. Not blocking on v1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long chats and memory
&lt;/h2&gt;

&lt;p&gt;Gemma 4 has a 128K context window. That's plenty in theory. In practice, every token costs latency and RAM, so I'd rather not feed the whole history every turn.&lt;/p&gt;

&lt;p&gt;PocketClaw keeps the most recent 24 messages in full text. Anything older runs through a compaction pass:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract facts the user has stated explicitly ("I am X", "My name is Y", "Remember Z").&lt;/li&gt;
&lt;li&gt;Capture unresolved goals (keywords like "fix", "todo", "issue").&lt;/li&gt;
&lt;li&gt;Compile the whole thing into a single lightweight summary paragraph that gets prepended to the prompt as memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The more aggressive part: when an image message slides past the 24-message boundary, the raw image bytes get deleted. What stays is the assistant's prior textual description of that image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="nf"&gt;_imageMemoryFromAssistant&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="kd"&gt;required&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;imageName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="kd"&gt;required&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="n"&gt;assistantText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imageName&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="s"&gt;'uploaded image'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;'Assistant previously described &lt;/span&gt;&lt;span class="si"&gt;$label&lt;/span&gt;&lt;span class="s"&gt; as: &lt;/span&gt;&lt;span class="si"&gt;${_shorten(assistantText, 1000)}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claw still "remembers" what it saw, but without the bytes weighing on the prompt. This matters more than it sounds like it should. A 1 MB photo as base64 is around 30K tokens. The textual description of the same image is around 100. So this is roughly a 300x compression of image memory, with surprisingly little loss for the kinds of follow-up questions users actually ask.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;p&gt;Flutter 3.41 / Dart 3.11. Chat model is Gemma 4 E2B INT4 at 1.5 GB, loaded through &lt;code&gt;flutter_gemma&lt;/code&gt; on top of MediaPipe LLM and LiteRT-LM. Embedder is Gecko 110M (110 MB) for RAG. Vector store is sqlite-vec with HNSW, on device. Speech-to-text is the system STT via &lt;code&gt;speech_to_text&lt;/code&gt;. Eight device actions go through a Kotlin MethodChannel. Hive for state (separate boxes for conversations, documents, prefs). &lt;code&gt;syncfusion_flutter_pdf&lt;/code&gt; for PDF extraction. The theme is custom neobrutalistic dark with hard borders, monospace type, and cyan/purple/mint accents. I wanted it to feel like a piece of hardware, not a Material 3 chat app.&lt;/p&gt;

&lt;h2&gt;
  
  
  What didn't make v1
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;floating overlay bubble&lt;/strong&gt; (Android 13+ accessibility-style overlay) was in the original design. I cut it because the overlay permission UX has a long tail I didn't want to ship rough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direct Gemma 4 audio input&lt;/strong&gt; is blocked on the plugin. v2 when the plugin supports E2B audio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-message RAG toggle.&lt;/strong&gt; Right now if you have a document attached, every message in that conversation retrieves against it. Sometimes users want to ask an unrelated follow-up without doc context. v2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunk citations in responses.&lt;/strong&gt; Claw answers from retrieved chunks but doesn't surface which chunk. The retrieval data is sitting there. It's a UI add. v2.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;Most "run an LLM on your phone" tutorials I've read this year stop at model loading. &lt;code&gt;flutter_gemma&lt;/code&gt; handles loading in about 10 lines. That part is easy.&lt;/p&gt;

&lt;p&gt;The work is in everything around the model. The compaction so long chats don't OOM. The RAG fallbacks for queries that don't fit the textbook similarity-search assumption. The native channels for actual device actions. The Gradle excludes for plugin libs you don't need. The prompt structure that gets a 2B model to follow instructions reliably. The error handling for when the model file is half downloaded and the user opens the app anyway.&lt;/p&gt;

&lt;p&gt;What surprised me about Gemma 4 E2B, coming from cloud models, is how capable a 2B model can be when it's running fast on your own device. Vision captioning is genuinely useful. Intent classification across 8 tool categories works well enough to ship. There's no perceived "AI thinking" delay because there's no network. The model speaks at the speed of your phone.&lt;/p&gt;

&lt;p&gt;For a 1.5 GB download, that's a real deal.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/ManojRakshu/pocketclaw" rel="noopener noreferrer"&gt;github.com/ManojRakshu/pocketclaw&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;License:&lt;/strong&gt; MIT&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Tested on:&lt;/strong&gt; OnePlus Nord CE 4 (Snapdragon 7s Gen 3, 8 GB RAM, Android 14)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Demo recorded:&lt;/strong&gt; May 24, 2026&lt;/p&gt;

&lt;p&gt;If you want to look at the design system, RAG service, or compaction engine, the source is fully open. PRs welcome.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Editorial assistance from Claude. All code, decisions, bugs, and engineering choices in this post are mine.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>flutter</category>
      <category>devchallenge</category>
    </item>
  </channel>
</rss>
