This is a submission for the Gemma 4 Challenge: Write About Gemma 4
I built an ASL hand sign interpreter using Gemma 4 locally. Not because I had a perfect plan, I just wanted to see if a general-purpose vision model could recognise hand signs with just a good prompt. No fine-tuning, no special dataset.
In short, it took way more iteration than I expected.
You have to keep going back and fixing the prompt
My first prompt was simple, just "identify the ASL letter in this image." The model kept getting confused between similar-looking signs. A and S look nothing alike if you know what to look for, but Gemma 4 didn't know what to look for.
So I had to go back, look up what actually makes each letter different, and write it out explicitly. Things like:
"A vs S: In A the thumb is BESIDE the index finger. In S the thumb is folded OVER the knuckles."
Every time I noticed a pattern of wrong answers, I'd add another rule. It became this back-and-forth loop, run test, see what fails, go back to the prompt, try again. That's just the reality of working with a model on a specific task it wasn't trained for.
The D -> 1 thing
At some point I ran my auto-sorter and it created a folder called 1 inside my test dataset. Turns out Gemma 4 looked at a D sign and decided it was the number 1.
Not entirely wrong. D does have one straight finger pointing up. But it was a good reminder that the model doesn't know the context unless you tell it. I had to add "you are identifying ASL alphabet letters only, A through Z, not numbers" to stop it wandering off.
Which model size and why
I tried thinking through E2B vs E4B vs 31B Dense before picking.
E2B was too confident in wrong answers - it wouldn't flag uncertainty, just return the wrong letter with high confidence. Not useful.
31B Dense is probably better at this but it won't run on a normal laptop. Defeats the local-first point.
E4B was the right call. Small enough to run locally, but it actually understands "I'm not sure about this one" - it'll return medium or low confidence instead of just guessing. That matters more than people think.
Honest results
After all the iteration, batch tested on 66 images:
Overall: 59.1%
Letters that hit 100%: B, D, E, F, G, H, K, R, T, U, X, Y, Z
Worst: I, O, P at 0%, V at 22%
Not perfect. But this is a general model, no training, just prompting. And I know exactly which letters to focus on next time I run the loop again - .
What I took away
You don't need a specialised model for everything. Sometimes a good prompt and a few iterations gets you further than you expect. But "a few iterations" is doing a lot of work in that sentence. Tt's genuinely trial and error, and you have to be okay with that.
Also, always check what folders your auto-sorter is creating - haha.
Top comments (0)