This is a submission for the Gemma 4 Challenge: Write About Gemma 4
I was building Maatru, a Telugu literacy app for kids whose parents can't comfortably teach the script. The original design had photo-feedback as the core interaction. A kid writes a letter on paper, takes a picture, the app compares the writing to the target and gives feedback. The mechanic felt right. Paper-and-pencil is how kids actually learn handwriting. And Gemma 4's multimodal capability was one of the strongest things in the announcement.
Before I built anything, I tested whether the capability held up for my use case. It didn't. Gemma 4 confidently misread typed Telugu characters as completely different characters. These were clean Unicode glyphs on white backgrounds, the easy case. Across 20 test samples spanning four difficulty tiers, the cloud variant (Gemma 4 31B Dense) got 4 right. The local variant (Gemma 4 E4B) got 1.
The interesting part of this story isn't that the capability failed. It's that I knew within a day, not three weeks into the build. This essay is about the evaluation discipline that produced that knowledge. And why I think anyone planning to build a non-trivial product on Gemma 4 (or any model) should run capability gates before architectural commitment.
To be clear, this isn't a Gemma 4 criticism. The model does many things well. The multilingual generation and function calling ended up powering Maatru's curriculum content and agentic planner. What I learned is that model spec sheets are claims to be tested. The cost of testing those claims early is much smaller than the cost of building on a capability that doesn't deliver.
How I structured the evaluation
A capability gate is an evaluation that answers one question. Does this model capability hold up for my use case, at a quality bar that makes my product viable? For Maatru's photo-feedback, the question was concrete. Can Gemma 4 reliably identify Telugu characters from images that look like what a kid actually produces?
I built the eval around four difficulty tiers. The typed tier was the easy case: five Telugu characters in a clean Unicode font, black on white, no noise. The adult-handwriting tier was five samples I wrote by hand. The child-handwriting tier was samples produced by a kid I knew, with shaky strokes and uneven spacing. The similar-looking tier was characters that share visual structure, the Telugu confusables set that kids commonly mix up. These test whether the model actually distinguishes between similar shapes or just pattern-matches loosely. Five samples per tier, twenty total. Small enough to run in an afternoon, large enough to surface real failure patterns.
I ran the eval against both Gemma 4 variants I had access to. E4B running locally through Ollama on an M4 MacBook Air with 16GB of RAM, and 31B Dense through OpenRouter's free tier. Testing both variants matters because Gemma 4 spec claims apply to the family. Specific variants may perform differently due to parameter count, distillation, or fine-tuning. My acceptance threshold, decided before running: typed+adult tiers ≥80% combined, child tier ≥60%. If the model couldn't reliably read clean reference glyphs and adult handwriting, the kid-handwriting case was hopeless. Defining the threshold before I saw results was important. Otherwise it's too easy to convince yourself that bad numbers are "actually pretty good."
What the data showed
The scorecard came back well below threshold. Cloud variant: 1 of 5 typed, 1 of 5 adult handwriting, 0 of 5 child handwriting, 2 of 5 similar-looking. Total 4 of 20, or 20%. Local variant: 1 of 5 typed, 0 of 5 adult, 0 of 5 child, 0 of 5 similar. Total 1 of 20, or 5%. The threshold required 80% on typed+adult combined. The cloud variant gave me 20% on that combined tier.
The failure mode mattered more than the aggregate number. I'd expected child-handwriting to be hard. Kids' strokes are messy, spacing is unpredictable, letter shapes drift from the canonical form. What I didn't expect was that the typed reference tier would also fail. The clearest example: Gemma 4 confidently misread అ, a basic Telugu vowel rendered cleanly on a white background, as ౦ (a Telugu numeral) on the cloud variant, and as ని (a different consonant-vowel combination) on the local variant. Same input image, two completely different wrong outputs from two independent inference paths. And neither model hedged. Both committed to wrong answers with no caveats.
That confidence without hedging is what made the failure mode dangerous for the product I was planning. A model that says "I think this might be అ but I'm uncertain" is a model you can build around. You handle the uncertainty in the UX. A model that says "this is ని" when it's actually అ ships a bug straight into a literacy tool for kids. Different wrong answers on different runs across independent inference paths tells you you're looking at a model knowledge gap on a specific subdomain. Telugu script recognition in Gemma 4's vision training distribution. Not a fixable engineering bug.
Ruling out the alternatives
Before accepting "the model can't reliably do this" as the explanation, I worked through the alternatives. The first candidate was image encoding. Maybe the test images weren't reaching the model correctly. I ruled this out because the cloud variant got 4 of 20 right, including 2 of the similar-looking pairs. That means the model was receiving and processing visual input correctly. An encoding bug would produce uniform failure across all 20 samples, not the spread of failures I actually saw.
The second candidate was API throttling or silent quality degradation. Maybe OpenRouter's free tier downgrades model output under load. I ruled this out because cloud providers don't quietly serve degraded outputs. They return errors. They don't return lower-quality answers silently. Also, the local E4B variant runs entirely on-device with no external dependency, and it produced even worse results than cloud. If both an unthrottled local run and a potentially-throttled cloud run produce the same failure pattern, throttling isn't the cause.
Cold-start effects were the third candidate. Maybe the model needed warm-up calls before producing reliable output. I'd added warm-up calls to the runner before the timed loop. Latencies stabilized after warm-up. But the accuracy problem persisted across all 20 samples. Cold-start affects latency, not accuracy.
What remained was the conclusion the data supported. Gemma 4's vision capability does not currently read Telugu script reliably enough to be the foundation of a literacy product. This is a model training-data limitation, not a fixable engineering bug. Indic scripts are likely underrepresented in the visual training corpus relative to Latin scripts. That isn't specific to Gemma 4. Most big models have the same gap with non-English visual recognition. But it does mean any product built on Gemma 4 vision for Indic scripts (or other underrepresented visual writing systems) inherits this limitation.
What I'd tell another developer
The discipline that made this story end in a one-day pivot instead of a three-week wasted build is what I think of as capability-gate-first engineering. Before you commit to an architecture that depends on a specific model capability working at a specific quality bar, build the smallest possible evaluation that tests that exact capability under conditions that approximate your actual use case. Set the threshold before you see results. Run it on day one, not week three. If the model passes, you've validated a key assumption. If it fails, you've saved yourself weeks of building on something that won't hold up.
For developers planning to build with Gemma 4 on non-English use cases: don't trust the spec sheet's "multimodal" or "multilingual" claims as proxies for your specific language or script. Test the exact capability you need on the exact data shape you'll see in production. For my use case that meant 20 Telugu samples across 4 difficulty tiers. For yours it might be 50 Arabic OCR samples, or 30 medical-image inferences, or 100 Hindi-handwriting captures. The shape of the eval depends on the use case. The discipline of running it before architectural commitment doesn't. Maatru shipped because the day-one capability gate told me to redesign around what Gemma 4 actually does well, multilingual text generation and agentic reasoning, rather than what the announcement promised. Honestly, that one day of testing saved the project.
Top comments (0)