This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
Gemma-San is an offline AI tutor for children aged 5–12, built for Nigerian kids first but speaking whatever language the child speaks. It runs Google's Gemma 4 E2B model fully on-device on a sub-$200 Android phone — no cloud, no data plan after the one-time setup.
A child taps the mic, talks (English, Pidgin, Yoruba, Hausa, Igbo — even Japanese), and Gemma-San responds with voice, illustrations, and patience. It knows when to ask a Socratic question and when to just explain. It remembers the child across sessions. It quizzes them on past lessons. It draws simple SVG diagrams when the topic isn't already in its illustration library.
Built with Flutter + flutter_gemma 0.15.1 + Whisper.cpp + sqflite. Targets 4–6 GB RAM Android phones like the Tecno Spark 10 and Infinix Hot 30 — the phones African kids actually share with their families.
The problem it solves: millions of kids in low-bandwidth regions have brilliant questions and no one to ask. Cloud tutors fail when the network does. Gemma-San lives inside the phone.
Core features:
- Voice in (Whisper tiny on-device) → Gemma 4 → voice out (Android TTS)
- Six native function-calling tools:
socratic_teach,direct_teach,encourage,remember,try_drawing,show_illustration - 22 hand-built SVG illustrations + an on-the-fly SVG drawing fallback
- Five-question quizzes and phonics practice with a spaced-repetition scheduler
- A three-tier memory system (working window + cross-session compaction + long-term facts) so the tutor knows the child
- Multilingual mirroring — the model replies in whatever language the child used, never imposes one
Demo
Code
Gemma-San
Offline AI tutor for Nigerian children, powered by Gemma 4 on-device.
Hackathon submission — deadline May 18, 2026.
What It Is
Gemma-San is a native Android app that acts as a patient, voice-first tutor for Nigerian children aged 5–12. It runs Google's Gemma 4 E2B model entirely on-device using flutter_gemma, so it works in classrooms with no internet after the initial model download. The tutor speaks in Nigerian Pidgin or English, uses the phone camera to annotate physical objects, and adapts to each child via a local memory system.
Why
Millions of Nigerian children lack access to quality, personalized tutoring. Gemma-San brings Socratic and direct-teach pedagogy to a $100 Android phone, offline, in the child's own language.
Architecture
┌─────────────────────────────────────────────┐
│ Flutter UI │
│ (Riverpod state, feature-first widgets) │
└────────────────────┬────────────────────────┘
│
┌────────────▼────────────┐
│ Domain / Use Cases │
└────────────┬────────────┘
│
┌─────────────────┼──────────────────┐
│ │ │
┌──▼──┐ ┌────▼────┐ ┌────▼────┐
│sqflite│ │flutter_…How I Used Gemma 4
Model: Gemma 4 E2B (litert-community/gemma-4-E2B-it.litertlm).
Why E2B and not E4B or 31B Dense: E4B added roughly 1.5 GB to disk and tripped the OOM killer on my 8 GB target device once Whisper.cpp loaded alongside it. 31B Dense was never an option — these are phones, not workstations. E2B's 2-billion effective parameters fits in about 2.4 GB on disk, runs at ~25–35 tok/s on the GPU backend via LiteRT-LM, and leaves headroom for STT/TTS/UI. The quality-vs-fit tradeoff was the entire bet.
Native function calling — six tools, no plain text
Every model reply is a function call. There is no free-text path. The system prompt has a Khanmigo-style decision tree the model walks top-down on every turn (visual request → illustration if exact match else SVG → direct fact → Socratic ladder → escalate after 2 IDKs). Side effect: the UI gets a structured TutorResponse with a mode field, so I render a different colored mode pill for each teaching style.
Selective thinking mode (the biggest discovery)
Gemma 4 supports thinking traces via enableThinking: true. I assumed "more thinking = better answers" and switched it on everywhere. That backfired.
Where thinking helps: routing across 6 tools. Without it (and with default topK=1), once Gemma's first JSON token commits, it can't recover.
Where thinking hurts: structured-output composition. My lesson-summary path enabled thinking and the model started spending its whole token budget on <|channel>thought|> traces, never reaching the actual lesson_summary function call. Raw thinking JSON leaked into the cached summary and rendered to kids as {"role":"assistant","channels":{"thought":"Thinking"}}.... Beautiful.
I now apply thinking per-path:
| Path | Thinking | Reason |
|---|---|---|
generate() chat |
ON | 6-tool routing benefits from reasoning |
generateWithImage() |
ON | Vision + routing |
generateLessonSummary() |
OFF | Single tool, pure composition |
And I always pass explicit sampling — the defaults are a trap for small models:
final session = await model.createSession(
tools: tools,
systemInstruction: sysPrompt,
enableThinking: true,
temperature: 0.4, // not 0.8 default — too noisy for tool selection
topK: 40, // not 1 default — pure greedy locks in wrong tools
topP: 0.9,
randomSeed: 1,
);
Compressing the prompt for a 2B model
My first system prompt was 1,450 tokens — a beautifully written teaching ladder. Gemma E2B's attention horizon couldn't hold it together with 6 tool schemas and conversation history. It misrouted, hallucinated tool names, and sometimes emitted plain text.
Rewriting it as a flat ~440-token decision tree with two worked examples (one Socratic, one direct-teach with the child speaking Pidgin and the model mirroring it back) took tool-call accuracy from "frustrating" to "actually usable."
Multilingual mirroring
Hard rule in the prompt: whatever language the child uses, the model uses back. English in, English out. Pidgin in, Pidgin out. Yoruba, Hausa, Igbo, Japanese — match it. The language_code BCP-47 field on every tool call propagates to Android TTS so the voice matches too (Google TTS preferred when installed; en-NG > en-GB > en-US fallback).
Recovery loop, never an error
When parsing fails (rare, but real on a 2B model), Gemma-San retries once with a nudge appended — "Respond by calling exactly ONE function. No plain text." — and falls back to a graceful direct_teach if the retry also fails. The child never sees an error message.
Lessons Learned
Thinking mode is a knife, not a hammer. ON for routing. OFF for composition. Mixing them in the same session causes JSON leaks and budget exhaustion.
topK=1+ thinking is the worst sampling combo on a small model. Pure greedy decoding can't recover from a bad first token even when the thinking trace has already considered alternatives. BumptopKto ~40 andtemperatureto ~0.4 for tool calling.Compress your prompt brutally. A 2B model with 6 tool schemas + working memory + a 1,400-token system prompt has nothing left for attention. A flat Khanmigo-style decision tree with one or two worked examples beats verbose pedagogy every time.
Validate, don't reject. My SVG validator auto-fixes the model's malformed
viewBox, injects missingfillattributes, and only rejects what truly can't render. Treat the model as a fragile partner, not a black box.Mirror, don't impose. Telling the model "respond in Pidgin" forced bad output when kids spoke English. Telling it "mirror the child's language" produced fluent code-switching across five Nigerian languages out of the box.
Built for the Gemma 4 Challenge. Feedback welcome — especially from anyone shipping on-device LLMs for low-resource contexts.
Top comments (0)