DEV Community: Santhoshkumar. P

Shipping on Gemma 4: chain-of-thought leakage, MoE-vs-Dense, and on-device pragmatism

Santhoshkumar. P — Wed, 20 May 2026 12:42:41 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Shipping on Gemma 4: chain-of-thought leakage, MoE-vs-Dense, and on-device pragmatism

I built and shipped Curio Kid, a kid-safe multimodal Android app where my 6-year-old asks Luna (a Gemma-4-powered tutor) anything by text, voice, or camera. The product story is in my other submission. This post is the engineering writeup — three things about Gemma 4 that I had to actually work around in production, with the code and reasoning behind each fix.

If you're about to ship a Gemma 4 app, these are the three traps I'd want to know about on day one.

1. Chain-of-thought leakage is real, and it hits the user

Gemma 4 is good at following structured system prompts. Too good, sometimes. Give it a strict persona spec and it will occasionally show you the rubric while answering. In my testing, a meaningful slice of responses came back like this:

**Intent:** child wants to know why the sky is blue
**Tone check:** warm, age-5 vocabulary, no jargon
**Final Polish:**

Great question! The sky is blue because…

The "Final Polish:" line is the give-away — Gemma is narrating its own polishing step before giving the answer. For a chatbot aimed at a 6-year-old, that's not a quirk; it's a UX bug.

The naïve fix doesn't work

The obvious instinct is "tell the model not to do this in the system prompt." I tried. My prompt now contains a half-page of "never write section labels like 'Final Polish', 'Self-Correction', 'Reasoning', 'Plan'… your very first word must be part of the actual answer". It helps. It doesn't eliminate.

The reason it can't eliminate the issue: instruction-following is a soft constraint. The same model that's smart enough to follow a 50-line persona spec is also smart enough to think about how to follow it — and sometimes the thinking ends up on the wire.

The real fix: a two-stage cleaner that knows the difference between meta and content

I ended up with a 100-line response sanitiser (LunaAI.kt) that does three things in order:

Stage 1 — Anchor detection. Look for "final answer" / "polished response" / "answer:" anchors on their own line. If present, throw away everything before the last one and keep only what follows. This handles the dominant failure mode (model planned out loud, then gave the real answer at the bottom).

val finalAnchorLine = Regex(
    "(?im)^\\s*\\*{0,2}\\s*" +
        "(?:final(?:\\s+(?:polish|answer|response|reply|draft|version))?" +
        "|polished(?:\\s+(?:answer|reply|response))?" +
        "|the\\s+answer|answer|response|reply)" +
        "\\s*\\*{0,2}\\s*:\\s*$",
)
val afterAnchor = finalAnchorLine.findAll(raw).lastOrNull()
    ?.let { raw.substring(it.range.last + 1) } ?: raw

Stage 2 — Paragraph-level meta filter. Split by blank lines, drop any paragraph that contains chain-of-thought prose cues — phrases like "the prompt says…", "I'll treat the question as…", "drafting…", "let me revise…". This catches the case where the model narrates its process in prose instead of with labels.

Stage 3 — Line-level scrub. A safety net for leaks embedded inside an otherwise-good paragraph: bullet/label lines like * Intent: … or **Tone check:** ….

The non-obvious part: what not to filter

The interesting design problem isn't writing the regex; it's making sure you don't kill legitimate content. Three rules I learned the hard way:

Never filter on the word "think" alone. Phrases like "Let me think of a fun example!" are exactly the warm tone you want from a kid's tutor. My meta-regex matches "let me / I'll / I will / I should" only when followed by a planning verb (plan, draft, rewrite, revise, polish, reconsider, interpret). "Let me think of" slips through. Good.
Only apply bullet-stripping when a leak is detected nearby. Gemma sometimes does legitimately produce a bulleted list when the kid asks "give me three facts about pandas". You don't want to scrub bullets unconditionally; you want to scrub them only when other meta-leakage is already visible on the page.
Have a fallback for "scrubbed to nothing". If filtering empties the response, return "Hmm, let me think about that another way — could you ask me again?" — not a blank bubble.

Takeaway

If you're building a user-facing app on Gemma 4 — especially with kids, customer support, or anywhere "the model thinking out loud is bad UX" — assume CoT leakage will happen and ship a sanitiser. A sanitiser is also dramatically cheaper than fine-tuning, and it composes with whichever model variant you swap in next month.

2. MoE vs Dense: how I actually chose between `gemma-4-26b-a4b-it` and `gemma-4-31b-it`

The Gemma 4 family ships three architectures for very different deployment targets:

Variant	Effective params	Architecture	Where it shines
E2B / E4B	2B / 4B	Small dense	Ultra-mobile, edge, browser
`gemma-4-26b-a4b-it`	26B (~4B active)	Mixture-of-Experts	Server-grade chat, multimodal, low-latency
`gemma-4-31b-it`	31B	Dense	Hardest reasoning, multi-step problems

For a kid-facing multimodal chat app I shipped the 26B MoE as the default and the 31B Dense as an opt-in "thinker" mode. Both have the 256K context window and Apache-2.0 licence, so the choice is purely about latency vs. depth.

Why MoE wins the default slot

The MoE's superpower isn't raw size — it's that only a slice of experts is activated per token. You pay ~4B of compute per token while keeping 26B of capacity available across the network. For my workload that translated into three concrete wins:

First-token latency that feels like chat, not like batch inference. Streaming starts in well under a second on Google AI Studio. A 6-year-old's patience is shorter than the inverse of his curiosity rate, so this matters more than benchmark scores.
Multimodal in the same model. No separate vision pass, no second API call for the image. "What kind of bug is this?" with a photo attached is one request.
256K context lets the Curiosity Digest be a one-shot. End of the day, I cat the whole transcript into a single prompt and ask Gemma to produce a structured digest. No RAG, no map-reduce summarisation. The whole "parent dashboard" feature is ~30 lines of glue because of this.

Why Dense earns its own button

For the questions that are genuinely hard — "why do mirrors flip left-and-right but not up-and-down?" is the canonical kid-stumper — the 31B Dense produces noticeably better multi-step reasoning. It's slower and pricier per call, so it's not the right default for "explain photosynthesis in three sentences", but it's the right tool when the kid trips into something philosophical.

The mental model I'd suggest for picking

Forget the parameter count for a second and ask three questions about your workload:

Is latency-to-first-token a UX requirement? → MoE.
Are you doing multimodal in the same call? → MoE (with image input).
Do you measurably gain on the hardest 10% of your prompts when you swap to Dense? → ship both, give users a toggle.

Don't pick on price; pick on what your prompts actually need. The MoE is the right answer for most chat workloads. The Dense is the right answer when you can articulate the reasoning gap.

3. On-device pragmatism: cloud-first isn't a cop-out

The most photogenic Gemma 4 demos run E2B on a Pixel or a Raspberry Pi. They're amazing. They're also not the right default for a consumer Android app you want real families to use.

Two realities pushed me cloud-first:

Not every phone can run Gemma 4 locally. A multi-gigabyte model needs the RAM, the storage, and the NPU/GPU to be worth the wait. Older flagships, mid-range phones, and the hand-me-down tablet a kid actually gets to use aren't there yet. Gating an app on "must own a current flagship" defeats the point of an accessible kids' app.
Quality matters more than offline-ness for a 6-year-old. A child being confidently told "the moon is made of cheese" by an under-cooked tiny model is a worse experience than a 2-second wait over Wi-Fi for the 26B MoE.

The architecture trick that lets you defer the choice

What I'd recommend for anyone shipping today: don't pick between cloud and on-device. Pick a backend interface and write three implementations.

interface LlmBackend {
    suspend fun ask(
        systemPrompt: String,
        history: List<ChatTurn>,
        userText: String,
        image: Bitmap?,
        modelName: String,
    ): String

    suspend fun summarise(
        systemPrompt: String,
        rawHistoryText: String,
        modelName: String,
    ): String
}

In Curio Kid that interface has three implementations: GoogleAiStudioBackend, OpenRouterBackend, and a LocalGemmaBackend stub that throws a friendly "on-device Gemma 4 isn't installed on this phone yet" until a MediaPipe .task file is wired in. Same system prompt, same response cleaner, same UI for all three. The provider is a single enum in EncryptedSharedPreferences and a one-tap toggle in Settings.

The pay-off: when E2B becomes the right default — when phones catch up, when multimodal lands in MediaPipe, when battery cost makes sense — I change one factory method. The persona, the safety prompt, the digest pipeline, the cleaner all carry over. The same kid talking to the same Luna; the model just moved into the phone.

That's the on-device pragmatism: don't bet on offline-first when your users can't run it, but don't lock yourself out of it either. Bet on the abstraction.

Three small SDK gotchas I'd want to have known on day one

While I'm here: three concrete Gemini-SDK-on-Android landmines that cost me an evening each.

The 80-second socket timeout is hard-coded. RequestOptions doesn't expose a knob to change it. If Gemma 4 takes longer than 80 seconds to start emitting tokens, you'll get a SocketTimeoutException even though the model is fine. Fix: use generateContentStream instead of generateContent. The read timer resets on each chunk, so as long as tokens are flowing you never trip the cap.
MAX_TOKENS throws, it doesn't return the partial text. The Kotlin SDK raises ResponseStoppedException from the .text convenience getter when finish reason ≠ STOP. You have to catch it and walk candidates[0].content.parts for TextParts yourself to recover the 90%-complete answer the user nearly got.
A 500 from the upstream model often surfaces as MissingFieldException from kotlinx-serialization. When the Gemini backend has a hiccup it returns JSON that the SDK's strict deserialiser doesn't recognise, and the exception you see is the serialisation failure, not the underlying 500. Worth normalising every error class through a single friendlyError() mapper that walks the cause chain — the real problem is usually two layers down.

TL;DR

Three lessons from shipping on Gemma 4:

CoT leakage is a UX problem, not a prompt problem. Ship a sanitiser. Be careful what you scrub.
MoE is the right default for chat; Dense is the right tool for hard reasoning. Give users the toggle, pick by latency-and-multimodal vs. reasoning-on-the-hardest-10%.
Cloud-first isn't a cop-out, but architecting for on-device later is non-negotiable. A LlmBackend interface with three implementations buys you the option.

The Gemma 4 family is the first open-model release where I genuinely had to think about which member to ship for which job — that's a great problem to have. If you're building on it, I hope these save you a weekend each.

Code is at github.com/sann3/curio-kid if you want to read the cleaner, the backend interface, or the friendly-error mapper in full. Happy to answer questions in the comments.

Thanks to the DEV team and Google for the challenge!

My 6-year-old asks 400 questions a day. So I built him a Gemma 4 AI tutor.

Santhoshkumar. P — Wed, 20 May 2026 12:26:19 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

My 6-year-old asks me four hundred questions a day — about clouds, his shadow, whether ants have birthdays. I love it, but I can't always stop what I'm doing, and the usual fallbacks (Google, YouTube, a generic chatbot) are either too dense, too distracting, or too unsafe to hand a small child. Curio Kid is the app I built so my son can keep asking — and actually get warm, kid-friendly answers — without me worrying about what he sees next.

What I Built

Curio Kid is a kid-safe Android app where a child asks anything — by typing, snapping a photo, attaching an image, or just talking — and gets a warm, age-appropriate answer from Luna, an AI tutor powered by Gemma 4. Answers are short on purpose: 2–5 sentences, an everyday analogy (Lego, swings, fruit), and a follow-up question to keep the curiosity loop running.

Designing it for my own kid forced some opinionated choices:

He can't reliably read or type yet, but he can talk and point a camera. Voice and camera are first-class inputs, not afterthoughts.
He will absolutely test the safety rails. Kids ask wild things ("what happens if I drink poison?", "why do people fight in wars?") — Luna has to handle them gracefully every single time.
I want to know what he's curious about, not spy on him. Hence the Curiosity Digest — a daily themed summary, not a chat log.

What makes it more than "yet another chatbot wrapper":

Multimodal input — text, gallery image, live camera, on-device speech-to-text.
Safety as a hard requirement — locked-down system prompt + Gemini safety thresholds pinned to LOW_AND_ABOVE across harassment, hate, sexually explicit, and dangerous content; unsafe topics get a fixed redirect to "a trusted adult."
Parent Dashboard — PIN-gated, with a one-tap Curiosity Digest: themes, highlights with quotes, dinner-table conversation starters, and an "anything to flag?" section.
Privacy-first — API key + PIN in EncryptedSharedPreferences (AES-256); question history in a local Room DB, excluded from cloud backup; the only network call is to the model endpoint with the user's own key.
Three interchangeable Gemma 4 back-ends — not every family phone can host a multi-gigabyte model on-device, so Google AI Studio (default, free tier, multimodal), OpenRouter, and a scaffolded on-device path are all swappable from Settings.
Output cleaning — Gemma 4 sometimes thinks out loud ("Final Polish:", "Let me revise…"); a post-processor strips those leaks so the child only sees the final answer.

Demo

https://raw.githubusercontent.com/sann3/curio-kid/main/demo/Home.png
https://raw.githubusercontent.com/sann3/curio-kid/main/demo/1i.png
https://raw.githubusercontent.com/sann3/curio-kid/main/demo/2i.png
https://raw.githubusercontent.com/sann3/curio-kid/main/demo/3i.png
https://raw.githubusercontent.com/sann3/curio-kid/main/demo/4i.png
https://raw.githubusercontent.com/sann3/curio-kid/main/demo/5i.png
https://raw.githubusercontent.com/sann3/curio-kid/main/demo/6i.png

https://raw.githubusercontent.com/sann3/curio-kid/main/demo/final.mp4

Code

GitHub: github.com/sann3/curio-kid.

How I Used Gemma 4

Curio Kid exposes two Gemma 4 variants in the model picker, and the choice is intentional.

`gemma-4-26b-a4b-it` — 26B Mixture-of-Experts (default)

The daily driver. A kid-facing chat app needs three things at once: multimodal, fast first-token latency, and smart enough to teach. MoE hits all three — only a slice of experts fires per token, so latency feels ~4B-class while depth stays 26B-class. In practice:

A child holding up a beetle to the camera gets an answer in a couple of seconds, not ten.
Streaming starts almost instantly, so chat bubbles fill in live (and incidentally dodge the Gemini SDK's hard-coded 80s socket timeout — Curio Kid uses generateContentStream for exactly this reason).
The 256K context window means the whole day's history fits into a single Curiosity Digest call — no RAG, no summarisation tricks.
Same model handles "Why is the sky blue?" and a photo of a moth.

Dense is overkill for "explain photosynthesis in three sentences"; E2B/E4B don't yet match 31B-class reasoning on the harder "why" questions kids love. MoE is the right middle.

`gemma-4-31b-it` — 31B Dense (optional "thinker" mode)

For genuinely hard questions ("Why do mirrors flip left-and-right but not up-and-down?"). Slower and pricier per call, but noticeably better on multi-step or counterintuitive reasoning. Same persona, same safety, same UI — just a heavier brain when the curiosity warrants it.

Why not E2B / E4B by default?

On-device is fully wired up via MediaPipe LLM Inference — Settings → On-device downloads a vision-capable Gemma 4 .task (resumable, sha256-checked, metered-network aware) and runs it through a process-wide LlmInference singleton with addImage for the camera path. But cloud stays the default because:

Not every phone can run Gemma 4 locally. Multi-GB models need RAM and storage the hand-me-down tablet a kid actually uses doesn't have. Gating first launch behind "Pixel 8 Pro + 1.6 GB cellular download" defeats the point.
Quality > offline for a six-year-old. Being told "the moon is made of cheese" by an under-cooked tiny model is worse than waiting two seconds over Wi-Fi.

So Google AI Studio is the zero-friction default, OpenRouter is the alt-cloud, and on-device is one Settings tap away for capable phones — same LlmBackend interface, same prompts, same cleaner.

Where Gemma 4 actually does the work

The chat. Multimodal (image + history + question) → kid-friendly paragraph. The system prompt is strict (2–5 sentences, analogies, ≤2 emojis, one follow-up, no markdown) and Gemma 4 follows it remarkably well.
Safety reasoning. Instead of a blocklist, Luna reasons about whether a topic is age-appropriate and produces a fixed redirect line — Gemma 4 is instruction-faithful enough to honour a "ONLY reply with this exact sentence" clause while still engaging naturally with the 99% of fine questions.
The Curiosity Digest. Day's transcript → structured markdown summary (themes / highlights / conversation starters / flags) in one shot — long-context + structured-output, no orchestration framework.

Bits I had to engineer around Gemma 4's quirks

Chain-of-thought leakage. Gemma 4 occasionally emits "Final Polish:" / "Self-Correction:" / "Let me rewrite…" before its real answer. cleanLunaReply (LunaAI.kt) detects anchors, drops planning paragraphs, and strips markdown emphasis — without nuking legit phrases like "Let me think of a fun example!".
MAX_TOKENS stops. The Gemini SDK throws ResponseStoppedException instead of returning partial text; I catch it on both one-shot and streaming paths and surface what already arrived.
80s socket timeout. Hard-coded in the Kotlin SDK with no RequestOptions override. Streaming resets the read timer per chunk, so slow first-byte doesn't kill the request.
Friendly errors. One friendlyError() mapper turns every 4xx/5xx/safety/quota/network failure into one short, kid-readable sentence ("Wow, so many questions today! Let's wait a minute and try again."), while logging the raw exception to a debug ring buffer.

Gemma 4 unlocked something I couldn't have shipped a year ago: a multimodal, instruction-faithful, locally-routable model smart enough to teach a six-year-old about black holes, safe enough to hand to that six-year-old, and efficient enough to be the default tier of a free app.

Thanks to the DEV team and Google for the challenge!

BigQuery dynamic SQL and managing temp tables

Santhoshkumar. P — Fri, 23 Apr 2021 14:13:24 +0000

Google introduced support for dynamic SQL in BigQuery. Developers working particularly in Oracle must have some liking for EXECUTE IMMEDIATE, the way you execute dynamic SQL queries. Such a feature in BigQuery was missing for a long time, and now that it is here, I can't wait to use it.

Choosing a problem statement

Let's choose a problem that easily resonates with every developer working with the Google BigQuery world. Who isn't noticing the large volume of temporary tables churned by the client drivers and large datasets. This is particularly true where downstream products implement a version of BigQuery driver and fail to leverage nice features like auto expiration of tables. Not so good part is the hygiene of the dataset, these tables stay forever until explicitly cleared. What is important for this blog is a problem statement to demonstrate the utility of dynamic SQL.

Lets address it using Dynamic SQL

Temporary tables do offer the convenience of caching large result sets. With data rapidly changing on BigQuery dataset, let us target the old temporary tables and remove those from the datasets.

Our primary goal is to clear all temporary tables older than 24 hours.
Achieving this goal needs some more information. We need to identify when a table was created. This is when INFORMATION_SCHEMA of BigQuery is helpful.
Last step is that I want this to be scheduled every day, without my intervention. Yes, you can schedule SQL statements using the BigQuery scheduled query feature.

To clear temporary tables across all datasets, let's write code employing dynamic SQL, iterate all the dataset using the INFORMATION_SCHEMA and delete the temp table using the timestamp and the name starting with temp_table_. And schedule the SQL code using the BigQuery scheduled query option. With this, all the temp tables that are older than 1 day should get automatically cleared at a daily cadence.

Data Platform zones

Santhoshkumar. P — Wed, 01 Jan 2020 02:43:53 +0000

I was in search of suitable names for zones in a data platform, and this is what I have until now.

Access zone
Additional zone
Analytics zone
Archive data zone
Canonical data zone
Certified zone
Clean zone
Cleansing zone
Consumer zone
Consumption zone
Curated zone
Dev zone
Exploration zone
Gold zone
Insights zone
Landing zone
Master data zone
Operationalization zone
Persisted zone
Process zone
Production zone
Published zone
Raw zone
Refined zone
Refinery zone
Reporting zone
Sandbox zone
Sensitive zone
Silver zone
Staging zone
Standard zone
Structured zone
Temporal Zone
Transformed zone
Transient zone
Trusted zone
User Drop zone
Work zone

Credits:
Public blogs and images.

DEV Community: Santhoshkumar. P

Shipping on Gemma 4: chain-of-thought leakage, MoE-vs-Dense, and on-device pragmatism

Shipping on Gemma 4: chain-of-thought leakage, MoE-vs-Dense, and on-device pragmatism

1. Chain-of-thought leakage is real, and it hits the user

The naïve fix doesn't work

The real fix: a two-stage cleaner that knows the difference between meta and content

The non-obvious part: what not to filter

Takeaway

2. MoE vs Dense: how I actually chose between gemma-4-26b-a4b-it and gemma-4-31b-it

Why MoE wins the default slot

Why Dense earns its own button

The mental model I'd suggest for picking

3. On-device pragmatism: cloud-first isn't a cop-out

The architecture trick that lets you defer the choice

Three small SDK gotchas I'd want to have known on day one

TL;DR

My 6-year-old asks 400 questions a day. So I built him a Gemma 4 AI tutor.

What I Built

Demo

Code

How I Used Gemma 4

gemma-4-26b-a4b-it — 26B Mixture-of-Experts (default)

gemma-4-31b-it — 31B Dense (optional "thinker" mode)

Why not E2B / E4B by default?

Where Gemma 4 actually does the work

Bits I had to engineer around Gemma 4's quirks

BigQuery dynamic SQL and managing temp tables

Choosing a problem statement

Lets address it using Dynamic SQL

Data Platform zones

2. MoE vs Dense: how I actually chose between `gemma-4-26b-a4b-it` and `gemma-4-31b-it`

`gemma-4-26b-a4b-it` — 26B Mixture-of-Experts (default)

`gemma-4-31b-it` — 31B Dense (optional "thinker" mode)