<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Santhoshkumar. P</title>
    <description>The latest articles on DEV Community by Santhoshkumar. P (@sann3).</description>
    <link>https://dev.to/sann3</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F304924%2F867f1d04-167f-4e9e-9c94-e94bd405d79c.jpg</url>
      <title>DEV Community: Santhoshkumar. P</title>
      <link>https://dev.to/sann3</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sann3"/>
    <language>en</language>
    <item>
      <title>Shipping on Gemma 4: chain-of-thought leakage, MoE-vs-Dense, and on-device pragmatism</title>
      <dc:creator>Santhoshkumar. P</dc:creator>
      <pubDate>Wed, 20 May 2026 12:42:41 +0000</pubDate>
      <link>https://dev.to/sann3/shipping-on-gemma-4-chain-of-thought-leakage-moe-vs-dense-and-on-device-pragmatism-2i1f</link>
      <guid>https://dev.to/sann3/shipping-on-gemma-4-chain-of-thought-leakage-moe-vs-dense-and-on-device-pragmatism-2i1f</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Shipping on Gemma 4: chain-of-thought leakage, MoE-vs-Dense, and on-device pragmatism
&lt;/h1&gt;

&lt;p&gt;I built and shipped &lt;a href="https://github.com/sann3/curio-kid" rel="noopener noreferrer"&gt;Curio Kid&lt;/a&gt;, a kid-safe multimodal Android app where my 6-year-old asks Luna (a Gemma-4-powered tutor) anything by text, voice, or camera. The product story is in my other submission. This post is the &lt;strong&gt;engineering writeup&lt;/strong&gt; — three things about Gemma 4 that I had to actually work around in production, with the code and reasoning behind each fix.&lt;/p&gt;

&lt;p&gt;If you're about to ship a Gemma 4 app, these are the three traps I'd want to know about on day one.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Chain-of-thought leakage is real, and it hits the user
&lt;/h2&gt;

&lt;p&gt;Gemma 4 is good at following structured system prompts. &lt;em&gt;Too good&lt;/em&gt;, sometimes. Give it a strict persona spec and it will occasionally &lt;strong&gt;show you the rubric while answering&lt;/strong&gt;. In my testing, a meaningful slice of responses came back like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Intent:** child wants to know why the sky is blue
**Tone check:** warm, age-5 vocabulary, no jargon
**Final Polish:**

Great question! The sky is blue because…
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "&lt;strong&gt;Final Polish:&lt;/strong&gt;" line is the give-away — Gemma is narrating its own polishing step before giving the answer. For a chatbot aimed at a 6-year-old, that's not a quirk; it's a UX bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  The naïve fix doesn't work
&lt;/h3&gt;

&lt;p&gt;The obvious instinct is "tell the model not to do this in the system prompt." I tried. My prompt now contains a half-page of &lt;em&gt;"never write section labels like 'Final Polish', 'Self-Correction', 'Reasoning', 'Plan'… your very first word must be part of the actual answer"&lt;/em&gt;. It helps. It doesn't eliminate.&lt;/p&gt;

&lt;p&gt;The reason it can't eliminate the issue: instruction-following is a soft constraint. The same model that's smart enough to follow a 50-line persona spec is also smart enough to &lt;em&gt;think&lt;/em&gt; about how to follow it — and sometimes the thinking ends up on the wire.&lt;/p&gt;

&lt;h3&gt;
  
  
  The real fix: a two-stage cleaner that knows the difference between meta and content
&lt;/h3&gt;

&lt;p&gt;I ended up with a 100-line response sanitiser (&lt;a href="https://github.com/sann3/curio-kid/blob/main/app/src/main/java/com/curiokid/app/ai/LunaAI.kt" rel="noopener noreferrer"&gt;&lt;code&gt;LunaAI.kt&lt;/code&gt;&lt;/a&gt;) that does three things in order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1 — Anchor detection.&lt;/strong&gt; Look for "final answer" / "polished response" / "answer:" anchors on their own line. If present, throw away everything &lt;em&gt;before the last one&lt;/em&gt; and keep only what follows. This handles the dominant failure mode (model planned out loud, then gave the real answer at the bottom).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;finalAnchorLine&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Regex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"(?im)^\\s*\\*{0,2}\\s*"&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
        &lt;span class="s"&gt;"(?:final(?:\\s+(?:polish|answer|response|reply|draft|version))?"&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
        &lt;span class="s"&gt;"|polished(?:\\s+(?:answer|reply|response))?"&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
        &lt;span class="s"&gt;"|the\\s+answer|answer|response|reply)"&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt;
        &lt;span class="s"&gt;"\\s*\\*{0,2}\\s*:\\s*$"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;afterAnchor&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;finalAnchorLine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lastOrNull&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;let&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;substring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;range&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;?:&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 2 — Paragraph-level meta filter.&lt;/strong&gt; Split by blank lines, drop any paragraph that contains chain-of-thought &lt;em&gt;prose cues&lt;/em&gt; — phrases like &lt;em&gt;"the prompt says…"&lt;/em&gt;, &lt;em&gt;"I'll treat the question as…"&lt;/em&gt;, &lt;em&gt;"drafting…"&lt;/em&gt;, &lt;em&gt;"let me revise…"&lt;/em&gt;. This catches the case where the model narrates its process in prose instead of with labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3 — Line-level scrub.&lt;/strong&gt; A safety net for leaks embedded inside an otherwise-good paragraph: bullet/label lines like &lt;code&gt;* Intent: …&lt;/code&gt; or &lt;code&gt;**Tone check:** …&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The non-obvious part: what &lt;em&gt;not&lt;/em&gt; to filter
&lt;/h3&gt;

&lt;p&gt;The interesting design problem isn't writing the regex; it's making sure you don't kill &lt;em&gt;legitimate&lt;/em&gt; content. Three rules I learned the hard way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Never filter on the word &lt;em&gt;"think"&lt;/em&gt; alone.&lt;/strong&gt; Phrases like &lt;em&gt;"Let me think of a fun example!"&lt;/em&gt; are exactly the warm tone you &lt;em&gt;want&lt;/em&gt; from a kid's tutor. My meta-regex matches &lt;em&gt;"let me / I'll / I will / I should"&lt;/em&gt; only when followed by a planning verb (&lt;em&gt;plan, draft, rewrite, revise, polish, reconsider, interpret&lt;/em&gt;). &lt;em&gt;"Let me think of"&lt;/em&gt; slips through. Good.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only apply bullet-stripping when a leak is detected nearby.&lt;/strong&gt; Gemma sometimes &lt;em&gt;does&lt;/em&gt; legitimately produce a bulleted list when the kid asks "give me three facts about pandas". You don't want to scrub bullets unconditionally; you want to scrub them only when other meta-leakage is already visible on the page.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Have a fallback for "scrubbed to nothing".&lt;/strong&gt; If filtering empties the response, return &lt;em&gt;"Hmm, let me think about that another way — could you ask me again?"&lt;/em&gt; — not a blank bubble.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Takeaway
&lt;/h3&gt;

&lt;p&gt;If you're building a user-facing app on Gemma 4 — especially with kids, customer support, or anywhere "the model thinking out loud is bad UX" — &lt;strong&gt;assume CoT leakage will happen and ship a sanitiser&lt;/strong&gt;. A sanitiser is also dramatically cheaper than fine-tuning, and it composes with whichever model variant you swap in next month.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. MoE vs Dense: how I actually chose between &lt;code&gt;gemma-4-26b-a4b-it&lt;/code&gt; and &lt;code&gt;gemma-4-31b-it&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The Gemma 4 family ships three architectures for very different deployment targets:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Effective params&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Where it shines&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;E2B / E4B&lt;/td&gt;
&lt;td&gt;2B / 4B&lt;/td&gt;
&lt;td&gt;Small dense&lt;/td&gt;
&lt;td&gt;Ultra-mobile, edge, browser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-26b-a4b-it&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;26B (~4B active)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Mixture-of-Experts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server-grade chat, multimodal, low-latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma-4-31b-it&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;31B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Dense&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hardest reasoning, multi-step problems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a kid-facing multimodal chat app I shipped the &lt;strong&gt;26B MoE as the default&lt;/strong&gt; and the &lt;strong&gt;31B Dense as an opt-in "thinker" mode&lt;/strong&gt;. Both have the 256K context window and Apache-2.0 licence, so the choice is purely about latency vs. depth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why MoE wins the default slot
&lt;/h3&gt;

&lt;p&gt;The MoE's superpower isn't raw size — it's that &lt;strong&gt;only a slice of experts is activated per token&lt;/strong&gt;. You pay ~4B of compute per token while keeping 26B of &lt;em&gt;capacity&lt;/em&gt; available across the network. For my workload that translated into three concrete wins:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;First-token latency that feels like chat, not like batch inference.&lt;/strong&gt; Streaming starts in well under a second on Google AI Studio. A 6-year-old's patience is shorter than the inverse of his curiosity rate, so this matters more than benchmark scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal in the same model.&lt;/strong&gt; No separate vision pass, no second API call for the image. &lt;em&gt;"What kind of bug is this?"&lt;/em&gt; with a photo attached is one request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;256K context lets the Curiosity Digest be a one-shot.&lt;/strong&gt; End of the day, I cat the whole transcript into a single prompt and ask Gemma to produce a structured digest. No RAG, no map-reduce summarisation. The whole "parent dashboard" feature is ~30 lines of glue because of this.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why Dense earns its own button
&lt;/h3&gt;

&lt;p&gt;For the questions that are genuinely &lt;em&gt;hard&lt;/em&gt; — "why do mirrors flip left-and-right but not up-and-down?" is the canonical kid-stumper — the 31B Dense produces noticeably better multi-step reasoning. It's slower and pricier per call, so it's not the right default for "explain photosynthesis in three sentences", but it's the right tool when the kid trips into something philosophical.&lt;/p&gt;

&lt;h3&gt;
  
  
  The mental model I'd suggest for picking
&lt;/h3&gt;

&lt;p&gt;Forget the parameter count for a second and ask three questions about &lt;em&gt;your&lt;/em&gt; workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Is latency-to-first-token a UX requirement?&lt;/strong&gt; → MoE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Are you doing multimodal in the same call?&lt;/strong&gt; → MoE (with image input).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do you measurably gain on the hardest 10% of your prompts when you swap to Dense?&lt;/strong&gt; → ship both, give users a toggle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't pick on price; pick on &lt;em&gt;what your prompts actually need&lt;/em&gt;. The MoE is the right answer for most chat workloads. The Dense is the right answer when you can articulate the reasoning gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. On-device pragmatism: cloud-first isn't a cop-out
&lt;/h2&gt;

&lt;p&gt;The most photogenic Gemma 4 demos run E2B on a Pixel or a Raspberry Pi. They're amazing. They're also &lt;strong&gt;not the right default for a consumer Android app you want real families to use.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two realities pushed me cloud-first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Not every phone can run Gemma 4 locally.&lt;/strong&gt; A multi-gigabyte model needs the RAM, the storage, and the NPU/GPU to be worth the wait. Older flagships, mid-range phones, and the hand-me-down tablet a kid actually gets to use aren't there yet. Gating an app on "must own a current flagship" defeats the point of an accessible kids' app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality matters more than offline-ness for a 6-year-old.&lt;/strong&gt; A child being confidently told &lt;em&gt;"the moon is made of cheese"&lt;/em&gt; by an under-cooked tiny model is a worse experience than a 2-second wait over Wi-Fi for the 26B MoE.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The architecture trick that lets you defer the choice
&lt;/h3&gt;

&lt;p&gt;What I'd recommend for anyone shipping today: don't pick &lt;em&gt;between&lt;/em&gt; cloud and on-device. &lt;strong&gt;Pick a backend interface and write three implementations.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;interface&lt;/span&gt; &lt;span class="nc"&gt;LlmBackend&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;suspend&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ChatTurn&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;,&lt;/span&gt;
        &lt;span class="n"&gt;userText&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Bitmap&lt;/span&gt;&lt;span class="p"&gt;?,&lt;/span&gt;
        &lt;span class="n"&gt;modelName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;

    &lt;span class="k"&gt;suspend&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;summarise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;rawHistoryText&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;modelName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Curio Kid that interface has three implementations: &lt;code&gt;GoogleAiStudioBackend&lt;/code&gt;, &lt;code&gt;OpenRouterBackend&lt;/code&gt;, and a &lt;code&gt;LocalGemmaBackend&lt;/code&gt; stub that throws a friendly &lt;em&gt;"on-device Gemma 4 isn't installed on this phone yet"&lt;/em&gt; until a MediaPipe &lt;code&gt;.task&lt;/code&gt; file is wired in. Same system prompt, same response cleaner, same UI for all three. The provider is a single enum in &lt;code&gt;EncryptedSharedPreferences&lt;/code&gt; and a one-tap toggle in Settings.&lt;/p&gt;

&lt;p&gt;The pay-off: when E2B becomes the right default — when phones catch up, when multimodal lands in MediaPipe, when battery cost makes sense — I change &lt;em&gt;one factory method&lt;/em&gt;. The persona, the safety prompt, the digest pipeline, the cleaner all carry over. &lt;strong&gt;The same kid talking to the same Luna; the model just moved into the phone.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the on-device pragmatism: don't bet on offline-first when your users can't run it, but don't lock yourself out of it either. Bet on the abstraction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three small SDK gotchas I'd want to have known on day one
&lt;/h2&gt;

&lt;p&gt;While I'm here: three concrete Gemini-SDK-on-Android landmines that cost me an evening each.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The 80-second socket timeout is hard-coded.&lt;/strong&gt; &lt;code&gt;RequestOptions&lt;/code&gt; doesn't expose a knob to change it. If Gemma 4 takes longer than 80 seconds to &lt;em&gt;start&lt;/em&gt; emitting tokens, you'll get a &lt;code&gt;SocketTimeoutException&lt;/code&gt; even though the model is fine. &lt;strong&gt;Fix: use &lt;code&gt;generateContentStream&lt;/code&gt; instead of &lt;code&gt;generateContent&lt;/code&gt;.&lt;/strong&gt; The read timer resets on each chunk, so as long as tokens are flowing you never trip the cap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MAX_TOKENS&lt;/code&gt; throws, it doesn't return the partial text.&lt;/strong&gt; The Kotlin SDK raises &lt;code&gt;ResponseStoppedException&lt;/code&gt; from the &lt;code&gt;.text&lt;/code&gt; convenience getter when finish reason ≠ STOP. You have to catch it and walk &lt;code&gt;candidates[0].content.parts&lt;/code&gt; for &lt;code&gt;TextPart&lt;/code&gt;s yourself to recover the 90%-complete answer the user nearly got.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A 500 from the upstream model often surfaces as &lt;code&gt;MissingFieldException&lt;/code&gt; from kotlinx-serialization.&lt;/strong&gt; When the Gemini backend has a hiccup it returns JSON that the SDK's strict deserialiser doesn't recognise, and the exception you see is the &lt;em&gt;serialisation&lt;/em&gt; failure, not the underlying 500. Worth normalising every error class through a single &lt;code&gt;friendlyError()&lt;/code&gt; mapper that walks the cause chain — the &lt;em&gt;real&lt;/em&gt; problem is usually two layers down.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Three lessons from shipping on Gemma 4:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CoT leakage is a UX problem, not a prompt problem.&lt;/strong&gt; Ship a sanitiser. Be careful what you scrub.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoE is the right default for chat; Dense is the right tool for hard reasoning.&lt;/strong&gt; Give users the toggle, pick by latency-and-multimodal vs. reasoning-on-the-hardest-10%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud-first isn't a cop-out, but architecting for on-device later is non-negotiable.&lt;/strong&gt; A &lt;code&gt;LlmBackend&lt;/code&gt; interface with three implementations buys you the option.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Gemma 4 family is the first open-model release where I genuinely had to think about &lt;em&gt;which member to ship for which job&lt;/em&gt; — that's a great problem to have. If you're building on it, I hope these save you a weekend each.&lt;/p&gt;

&lt;p&gt;Code is at &lt;strong&gt;&lt;a href="https://github.com/sann3/curio-kid" rel="noopener noreferrer"&gt;github.com/sann3/curio-kid&lt;/a&gt;&lt;/strong&gt; if you want to read the cleaner, the backend interface, or the friendly-error mapper in full. Happy to answer questions in the comments.&lt;/p&gt;

&lt;p&gt;Thanks to the DEV team and Google for the challenge!&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
    <item>
      <title>My 6-year-old asks 400 questions a day. So I built him a Gemma 4 AI tutor.</title>
      <dc:creator>Santhoshkumar. P</dc:creator>
      <pubDate>Wed, 20 May 2026 12:26:19 +0000</pubDate>
      <link>https://dev.to/sann3/my-6-year-old-asks-400-questions-a-day-so-i-built-him-a-gemma-4-ai-tutor-1e13</link>
      <guid>https://dev.to/sann3/my-6-year-old-asks-400-questions-a-day-so-i-built-him-a-gemma-4-ai-tutor-1e13</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;My 6-year-old asks me four hundred questions a day — about clouds, his shadow, whether ants have birthdays. I love it, but I can't always stop what I'm doing, and the usual fallbacks (Google, YouTube, a generic chatbot) are either too dense, too distracting, or too unsafe to hand a small child. &lt;strong&gt;Curio Kid is the app I built so my son can keep asking — and actually get warm, kid-friendly answers — without me worrying about what he sees next.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Curio Kid&lt;/strong&gt; is a kid-safe Android app where a child asks &lt;strong&gt;anything&lt;/strong&gt; — by typing, snapping a photo, attaching an image, or just talking — and gets a warm, age-appropriate answer from &lt;strong&gt;Luna&lt;/strong&gt;, an AI tutor powered by &lt;strong&gt;Gemma 4&lt;/strong&gt;. Answers are short on purpose: 2–5 sentences, an everyday analogy (Lego, swings, fruit), and a follow-up question to keep the curiosity loop running.&lt;/p&gt;

&lt;p&gt;Designing it for my own kid forced some opinionated choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;He can't reliably read or type yet, but he can talk and point a camera.&lt;/strong&gt; Voice and camera are first-class inputs, not afterthoughts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;He will absolutely test the safety rails.&lt;/strong&gt; Kids ask wild things (&lt;em&gt;"what happens if I drink poison?"&lt;/em&gt;, &lt;em&gt;"why do people fight in wars?"&lt;/em&gt;) — Luna has to handle them gracefully &lt;em&gt;every single time&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I want to know what he's curious about, not spy on him.&lt;/strong&gt; Hence the &lt;strong&gt;Curiosity Digest&lt;/strong&gt; — a daily themed summary, not a chat log.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What makes it more than "yet another chatbot wrapper":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal input&lt;/strong&gt; — text, gallery image, live camera, on-device speech-to-text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety as a hard requirement&lt;/strong&gt; — locked-down system prompt + Gemini safety thresholds pinned to &lt;code&gt;LOW_AND_ABOVE&lt;/code&gt; across harassment, hate, sexually explicit, and dangerous content; unsafe topics get a fixed redirect to "a trusted adult."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parent Dashboard&lt;/strong&gt; — PIN-gated, with a one-tap &lt;strong&gt;Curiosity Digest&lt;/strong&gt;: themes, highlights with quotes, dinner-table conversation starters, and an "anything to flag?" section.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy-first&lt;/strong&gt; — API key + PIN in &lt;code&gt;EncryptedSharedPreferences&lt;/code&gt; (AES-256); question history in a local Room DB, excluded from cloud backup; the only network call is to the model endpoint with the user's own key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three interchangeable Gemma 4 back-ends&lt;/strong&gt; — not every family phone can host a multi-gigabyte model on-device, so &lt;strong&gt;Google AI Studio&lt;/strong&gt; (default, free tier, multimodal), &lt;strong&gt;OpenRouter&lt;/strong&gt;, and a scaffolded &lt;strong&gt;on-device&lt;/strong&gt; path are all swappable from Settings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output cleaning&lt;/strong&gt; — Gemma 4 sometimes thinks out loud (&lt;em&gt;"Final Polish:"&lt;/em&gt;, &lt;em&gt;"Let me revise…"&lt;/em&gt;); a post-processor strips those leaks so the child only sees the final answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/sann3/curio-kid/main/demo/Home.png" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/sann3/curio-kid/main/demo/Home.png&lt;/a&gt; &lt;br&gt;
&lt;a href="https://raw.githubusercontent.com/sann3/curio-kid/main/demo/1i.png" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/sann3/curio-kid/main/demo/1i.png&lt;/a&gt;&lt;br&gt;
&lt;a href="https://raw.githubusercontent.com/sann3/curio-kid/main/demo/2i.png" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/sann3/curio-kid/main/demo/2i.png&lt;/a&gt;&lt;br&gt;
&lt;a href="https://raw.githubusercontent.com/sann3/curio-kid/main/demo/3i.png" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/sann3/curio-kid/main/demo/3i.png&lt;/a&gt;&lt;br&gt;
&lt;a href="https://raw.githubusercontent.com/sann3/curio-kid/main/demo/4i.png" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/sann3/curio-kid/main/demo/4i.png&lt;/a&gt;&lt;br&gt;
&lt;a href="https://raw.githubusercontent.com/sann3/curio-kid/main/demo/5i.png" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/sann3/curio-kid/main/demo/5i.png&lt;/a&gt;&lt;br&gt;
&lt;a href="https://raw.githubusercontent.com/sann3/curio-kid/main/demo/6i.png" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/sann3/curio-kid/main/demo/6i.png&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/sann3/curio-kid/main/demo/final.mp4" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/sann3/curio-kid/main/demo/final.mp4&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;GitHub: &lt;strong&gt;&lt;a href="https://github.com/sann3/curio-kid" rel="noopener noreferrer"&gt;github.com/sann3/curio-kid&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;Curio Kid exposes &lt;strong&gt;two Gemma 4 variants&lt;/strong&gt; in the model picker, and the choice is intentional.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;gemma-4-26b-a4b-it&lt;/code&gt; — 26B Mixture-of-Experts (default)
&lt;/h3&gt;

&lt;p&gt;The daily driver. A kid-facing chat app needs three things at once: &lt;strong&gt;multimodal&lt;/strong&gt;, &lt;strong&gt;fast first-token latency&lt;/strong&gt;, and &lt;strong&gt;smart enough to teach&lt;/strong&gt;. MoE hits all three — only a slice of experts fires per token, so latency feels ~4B-class while depth stays 26B-class. In practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A child holding up a beetle to the camera gets an answer in a couple of seconds, not ten.&lt;/li&gt;
&lt;li&gt;Streaming starts almost instantly, so chat bubbles fill in live (and incidentally dodge the Gemini SDK's hard-coded 80s socket timeout — Curio Kid uses &lt;code&gt;generateContentStream&lt;/code&gt; for exactly this reason).&lt;/li&gt;
&lt;li&gt;The 256K context window means the whole day's history fits into a single &lt;strong&gt;Curiosity Digest&lt;/strong&gt; call — no RAG, no summarisation tricks.&lt;/li&gt;
&lt;li&gt;Same model handles &lt;em&gt;"Why is the sky blue?"&lt;/em&gt; &lt;strong&gt;and&lt;/strong&gt; a photo of a moth.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dense is overkill for "explain photosynthesis in three sentences"; E2B/E4B don't yet match 31B-class reasoning on the harder "why" questions kids love. MoE is the right middle.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;gemma-4-31b-it&lt;/code&gt; — 31B Dense (optional "thinker" mode)
&lt;/h3&gt;

&lt;p&gt;For genuinely hard questions (&lt;em&gt;"Why do mirrors flip left-and-right but not up-and-down?"&lt;/em&gt;). Slower and pricier per call, but noticeably better on multi-step or counterintuitive reasoning. Same persona, same safety, same UI — just a heavier brain when the curiosity warrants it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not E2B / E4B by default?
&lt;/h3&gt;

&lt;p&gt;On-device is fully wired up via MediaPipe LLM Inference — Settings → &lt;strong&gt;On-device&lt;/strong&gt; downloads a vision-capable Gemma 4 &lt;code&gt;.task&lt;/code&gt; (resumable, sha256-checked, metered-network aware) and runs it through a process-wide &lt;code&gt;LlmInference&lt;/code&gt; singleton with &lt;code&gt;addImage&lt;/code&gt; for the camera path. But cloud stays the &lt;strong&gt;default&lt;/strong&gt; because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Not every phone can run Gemma 4 locally.&lt;/strong&gt; Multi-GB models need RAM and storage the hand-me-down tablet a kid actually uses doesn't have. Gating first launch behind "Pixel 8 Pro + 1.6 GB cellular download" defeats the point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality &amp;gt; offline for a six-year-old.&lt;/strong&gt; Being told &lt;em&gt;"the moon is made of cheese"&lt;/em&gt; by an under-cooked tiny model is worse than waiting two seconds over Wi-Fi.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So Google AI Studio is the zero-friction default, OpenRouter is the alt-cloud, and on-device is one Settings tap away for capable phones — same &lt;code&gt;LlmBackend&lt;/code&gt; interface, same prompts, same cleaner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where Gemma 4 actually does the work
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The chat.&lt;/strong&gt; Multimodal &lt;code&gt;(image + history + question) → kid-friendly paragraph&lt;/code&gt;. The system prompt is strict (2–5 sentences, analogies, ≤2 emojis, one follow-up, no markdown) and Gemma 4 follows it remarkably well.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety reasoning.&lt;/strong&gt; Instead of a blocklist, Luna &lt;em&gt;reasons&lt;/em&gt; about whether a topic is age-appropriate and produces a fixed redirect line — Gemma 4 is instruction-faithful enough to honour a "ONLY reply with this exact sentence" clause while still engaging naturally with the 99% of fine questions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Curiosity Digest.&lt;/strong&gt; Day's transcript → structured markdown summary (themes / highlights / conversation starters / flags) in one shot — long-context + structured-output, no orchestration framework.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Bits I had to engineer around Gemma 4's quirks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chain-of-thought leakage.&lt;/strong&gt; Gemma 4 occasionally emits &lt;em&gt;"Final Polish:"&lt;/em&gt; / &lt;em&gt;"Self-Correction:"&lt;/em&gt; / &lt;em&gt;"Let me rewrite…"&lt;/em&gt; before its real answer. &lt;code&gt;cleanLunaReply&lt;/code&gt; (&lt;code&gt;LunaAI.kt&lt;/code&gt;) detects anchors, drops planning paragraphs, and strips markdown emphasis — without nuking legit phrases like &lt;em&gt;"Let me think of a fun example!"&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;MAX_TOKENS&lt;/code&gt; stops.&lt;/strong&gt; The Gemini SDK throws &lt;code&gt;ResponseStoppedException&lt;/code&gt; instead of returning partial text; I catch it on both one-shot and streaming paths and surface what already arrived.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;80s socket timeout.&lt;/strong&gt; Hard-coded in the Kotlin SDK with no &lt;code&gt;RequestOptions&lt;/code&gt; override. Streaming resets the read timer per chunk, so slow first-byte doesn't kill the request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Friendly errors.&lt;/strong&gt; One &lt;code&gt;friendlyError()&lt;/code&gt; mapper turns every 4xx/5xx/safety/quota/network failure into one short, kid-readable sentence (&lt;em&gt;"Wow, so many questions today! Let's wait a minute and try again."&lt;/em&gt;), while logging the raw exception to a debug ring buffer.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Gemma 4 unlocked something I couldn't have shipped a year ago: a &lt;strong&gt;multimodal, instruction-faithful, locally-routable&lt;/strong&gt; model smart enough to teach a six-year-old about black holes, safe enough to hand to that six-year-old, and efficient enough to be the default tier of a free app.&lt;/p&gt;

&lt;p&gt;Thanks to the DEV team and Google for the challenge!&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemma</category>
      <category>gemmachallenge</category>
    </item>
    <item>
      <title>BigQuery dynamic SQL and managing temp tables</title>
      <dc:creator>Santhoshkumar. P</dc:creator>
      <pubDate>Fri, 23 Apr 2021 14:13:24 +0000</pubDate>
      <link>https://dev.to/sann3/bigquery-dynamic-sql-and-managing-temp-tables-5c43</link>
      <guid>https://dev.to/sann3/bigquery-dynamic-sql-and-managing-temp-tables-5c43</guid>
      <description>&lt;p&gt;Google introduced support for dynamic SQL in BigQuery. Developers working particularly in Oracle must have some liking for EXECUTE IMMEDIATE, the way you execute dynamic SQL queries. Such a feature in BigQuery was missing for a long time, and now that it is here, I can't wait to use it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choosing a problem statement
&lt;/h3&gt;

&lt;p&gt;Let's choose a problem that easily resonates with every developer working with the Google BigQuery world. Who isn't noticing the large volume of temporary tables churned by the client drivers and large datasets. This is particularly true where downstream products implement a version of BigQuery driver and fail to leverage nice features like auto expiration of tables. Not so good part is the hygiene of the dataset, these tables stay forever until explicitly cleared.  What is important for this blog is a problem statement to demonstrate the utility of dynamic SQL.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lets address it using Dynamic SQL
&lt;/h3&gt;

&lt;p&gt;Temporary tables do offer the convenience of caching large result sets. With data rapidly changing on BigQuery dataset, let us target the old temporary tables and remove those from the datasets. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our primary goal is to clear all temporary tables older than 24 hours. &lt;/li&gt;
&lt;li&gt;Achieving this goal needs some more information. We need to identify when a table was created. This is when INFORMATION_SCHEMA of BigQuery is helpful. &lt;/li&gt;
&lt;li&gt;Last step is that I want this to be scheduled every day, without my intervention. Yes, you can schedule SQL statements using the BigQuery scheduled query feature. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To clear temporary tables across all datasets, let's write code employing dynamic SQL, iterate all the dataset using the INFORMATION_SCHEMA and delete the temp table using the timestamp and the name starting with temp_table_. And schedule the SQL code using the BigQuery scheduled query option. With this, all the temp tables that are older than 1 day should get automatically cleared at a daily cadence.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Data Platform zones
</title>
      <dc:creator>Santhoshkumar. P</dc:creator>
      <pubDate>Wed, 01 Jan 2020 02:43:53 +0000</pubDate>
      <link>https://dev.to/sann3/data-platform-zone-names-a8e</link>
      <guid>https://dev.to/sann3/data-platform-zone-names-a8e</guid>
      <description>&lt;p&gt;I was in search of suitable names for zones in a data platform, and this is what I have until now.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access zone&lt;/li&gt;
&lt;li&gt;Additional zone&lt;/li&gt;
&lt;li&gt;Analytics zone&lt;/li&gt;
&lt;li&gt;Archive data zone&lt;/li&gt;
&lt;li&gt;Canonical data zone&lt;/li&gt;
&lt;li&gt;Certified zone&lt;/li&gt;
&lt;li&gt;Clean zone&lt;/li&gt;
&lt;li&gt;Cleansing zone&lt;/li&gt;
&lt;li&gt;Consumer zone&lt;/li&gt;
&lt;li&gt;Consumption zone&lt;/li&gt;
&lt;li&gt;Curated zone&lt;/li&gt;
&lt;li&gt;Dev zone&lt;/li&gt;
&lt;li&gt;Exploration zone&lt;/li&gt;
&lt;li&gt;Gold zone&lt;/li&gt;
&lt;li&gt;Insights zone&lt;/li&gt;
&lt;li&gt;Landing zone&lt;/li&gt;
&lt;li&gt;Master data zone&lt;/li&gt;
&lt;li&gt;Operationalization zone&lt;/li&gt;
&lt;li&gt;Persisted zone&lt;/li&gt;
&lt;li&gt;Process zone&lt;/li&gt;
&lt;li&gt;Production zone&lt;/li&gt;
&lt;li&gt;Published zone&lt;/li&gt;
&lt;li&gt;Raw zone&lt;/li&gt;
&lt;li&gt;Refined zone&lt;/li&gt;
&lt;li&gt;Refinery zone&lt;/li&gt;
&lt;li&gt;Reporting zone&lt;/li&gt;
&lt;li&gt;Sandbox zone&lt;/li&gt;
&lt;li&gt;Sensitive zone&lt;/li&gt;
&lt;li&gt;Silver zone&lt;/li&gt;
&lt;li&gt;Staging zone&lt;/li&gt;
&lt;li&gt;Standard zone&lt;/li&gt;
&lt;li&gt;Structured zone&lt;/li&gt;
&lt;li&gt;Temporal Zone&lt;/li&gt;
&lt;li&gt;Transformed zone&lt;/li&gt;
&lt;li&gt;Transient zone&lt;/li&gt;
&lt;li&gt;Trusted zone&lt;/li&gt;
&lt;li&gt;User Drop zone&lt;/li&gt;
&lt;li&gt;Work zone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Credits:&lt;/strong&gt;&lt;br&gt;
Public blogs and images. &lt;/p&gt;

</description>
      <category>dataplatform</category>
      <category>zone</category>
      <category>data</category>
      <category>platform</category>
    </item>
  </channel>
</rss>
