Ngawang Tenzin

Posted on May 17

I built a real-time ASL interpreter for the Gemma4 challenge, no cloud needed

#devchallenge #gemmachallenge #gemma #opensource

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

A real-time American Sign Language (ASL) alphabet interpreter that runs 100% on your own machine - no api key, no cloud, no subscription. Hold up your hand showing a sign-lan in front of your webcam, hit capture, and Gemma4 tells you which letter it thinks it sees, how confident it is, and what it notice about your hand position.

The pipeline design is:
Webcam -> MediaPipe (hand detection + cropping only) -> Gemma4 does the hard work (i.e., ASL recognition) -> Result

Repo + setup: [https://github.com/cbms26/asl-interpreter]

Why this, and why Gemma4:e4b

Most ASL recognition tools out there either need a cloud API or have rely on a dedicated sign-language model pre-trained specifically on ASL datasets. Both approaches have real trade-offs; one cloud APIs mean your hand gestures are going to some server, and secondly, a purpose-built model means you're stuck with whatever letters it was trained on.
I wanted to see how well a general purpose vision model like Gemma4 could
handle ASL recognition just by being given a good specific-description of what each letter looks like. No fine-tuning, no dataset, just programmatic prompting. And everything had to run locally because privacy matters (esp. when your webcam is in use).

I went with gemma4:e4b specifically. E2B was too small to handle the
subtle differences between similar letters. The 31B Dense model is great
but won't run on most people's laptops. E4B hit the sweet spot — capable
enough to reason about fine hand shapes, light enough to run locally.

The one design decision worth explaining

The obvious approach is sending the raw webcam frame to Gemma 4 and asking
it to identify the sign. That works, but accuracy is noticeably worse
because the model splits its attention between the hand, the background,
the face, the lighting, and everything else in frame.

The fix was adding MediaPipe as a pre-processing step. It runs in the
browser at ~15fps, detects the hand, and crops a tight (512×512) square
around it. That cropped image, hand only, nothing else - is what Gemma4
actually sees.
MediaPipe handles where the hand is. Gemma 4 handles what letter it is. Separating those two jobs made the biggest accuracy difference of anything I tried.

Prompt engineering did real work here

Gemma 4 had no ASL training. To make up for that, I wrote precise specific-descriptions of all 26 letters and explicit disambiguation rules for the pairs the model kept getting confused:

"A vs S: In A the thumb is BESIDE the index finger. In S the thumb is
folded OVER the knuckles."
"M vs N vs T: All involve fingers folded over the thumb.
M = 3 fingers, N = 2 fingers, T = thumb between index and middle."

Writing those out forced me to actually understand the signs properly.
And it worked - the confused-pair accuracy improved significantly once
the model had explicit rules to fall back on. No fine-tuning needed.

How I measured it (not just vibes)

I didn't just demo the app and call it done. I built a proper evaluation
pipeline:

Extract frames from an ASL tutorial video
Let Gemma 4 auto-sort them into test_signs/<LETTER>/ folders
Run batch_tester.py to measure per-letter accuracy with ground truth

Every interaction — webcam captures, batch runs, model sorts — gets
appended to a timestamped CSV log. The webcam UI also has a thumbs-up/thumbs-down feedback button. If the model is wrong, you pick the correct letter from an A-Z picker so the log captures what it got wrong, not just that it got it wrong.

Results so far:
folder 1 got created due to model thinking letter D sign as 1 - interesting

Summary report:

Total images: 66
Overall: 59.1%
Best letters: B, D, E, F, G, H, K, R, T, U, X, Y, Z -> 100%
Worst letters: I (0%), O (0%), P (0%), V (22%)

During live webcam testing, the built-in feedback captured 15 sessions - 3 correct (thumbs up) and 12 incorrect (thumbs down with the actual letter logged). Common webcam misidentifications included O predicted instead of A, C, or E, and blank/low-confidence responses on letters like D and F.

What surprised me

The hardest letters aren't the ones I expected. J and Z both involve
motion in real ASL - J traces an arc, Z traces a Z shape in the air.
In a still image they look almost identical to I and A. Gemma 4 actually
handles this well by flagging them as low confidence rather than
guessing confidently wrong. That's the honest answer.
Clean input also mattered more than prompt length. The biggest accuracy jump came from cropping with MediaPipe and clear background, not from making the prompt more detailed.

Demo

Short demo walkthrough - Goes to YouTube

It was really fun to take on this challenge. Thank you DEV community!

DEV Community