What I actually learned using Gemma 4 for ASL recognition (spoiler: D kept showing up as 1)

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I built an ASL hand sign interpreter using Gemma 4 locally. Not because I had a perfect plan, I just wanted to see if a general-purpose vision model could recognise hand signs with just a good prompt. No fine-tuning, no special dataset.

In short, it took way more iteration than I expected.

You have to keep going back and fixing the prompt

My first prompt was simple, just "identify the ASL letter in this image." The model kept getting confused between similar-looking signs. A and S look nothing alike if you know what to look for, but Gemma 4 didn't know what to look for.

So I had to go back, look up what actually makes each letter different, and write it out explicitly. Things like:

"A vs S: In A the thumb is BESIDE the index finger. In S the thumb is folded OVER the knuckles."

Every time I noticed a pattern of wrong answers, I'd add another rule. It became this back-and-forth loop, run test, see what fails, go back to the prompt, try again. That's just the reality of working with a model on a specific task it wasn't trained for.

The D -> 1 thing

At some point I ran my auto-sorter and it created a folder called 1 inside my test dataset. Turns out Gemma 4 looked at a D sign and decided it was the number 1.

Not entirely wrong. D does have one straight finger pointing up. But it was a good reminder that the model doesn't know the context unless you tell it. I had to add "you are identifying ASL alphabet letters only, A through Z, not numbers" to stop it wandering off.

Which model size and why

I tried thinking through E2B vs E4B vs 31B Dense before picking.

E2B was too confident in wrong answers - it wouldn't flag uncertainty, just return the wrong letter with high confidence. Not useful.

31B Dense is probably better at this but it won't run on a normal laptop. Defeats the local-first point.

E4B was the right call. Small enough to run locally, but it actually understands "I'm not sure about this one" - it'll return medium or low confidence instead of just guessing. That matters more than people think.

Honest results

After all the iteration, batch tested on 66 images:

Overall: 59.1%
Letters that hit 100%: B, D, E, F, G, H, K, R, T, U, X, Y, Z
Worst: I, O, P at 0%, V at 22%
Not perfect. But this is a general model, no training, just prompting. And I know exactly which letters to focus on next time I run the loop again - .

What I took away

You don't need a specialised model for everything. Sometimes a good prompt and a few iterations gets you further than you expect. But "a few iterations" is doing a lot of work in that sentence. Tt's genuinely trial and error, and you have to be okay with that.

Also, always check what folders your auto-sorter is creating - haha.