noor yadallee

Posted on May 16

Real-Time Sign Language Translation with MediaPipe, Flutter, and Gemini Nano

#machinelearning #gemini #mediapipe #flutter

SignSpeak

Real-Time Sign Language Translation with MediaPipe, Flutter, and Gemini Nano

Build with AI - 2026 | Noor

SignSpeak is a mobile application designed to bridge the communication gap for the Deaf and Hard-of-Hearing community. By combining on-device computer vision with an on-device large language model, it translates American Sign Language (ASL) gestures and custom hand signs into natural, complete English sentences in real time - entirely offline, with no data sent to any server. Its defining feature is a fully customizable vocabulary: new signs, names, or even full phrase shortcuts can be added by any user without retraining from scratch.

1. MediaPipe: The Digital Skeleton of Gesture Recognition

What is MediaPipe?

MediaPipe is an open-source framework developed by Google for building multimodal, on-device machine learning pipelines. It provides production-ready solutions for tasks such as face detection, pose estimation, and - most relevant here - hand landmark detection. MediaPipe is designed to run efficiently on mobile hardware without requiring a network connection, making it a natural fit for a privacy-first application.

In SignSpeak, the Hand Landmarker solution is used. It detects and tracks the 3D coordinates of 21 individual points (landmarks) on a human hand from a single camera frame. These points cover every joint of every finger plus the wrist, providing a rich geometric representation of any hand shape or pose.

The 21 Hand Landmarks

Each landmark is expressed as a normalised (x, y, z) coordinate. The x and y values are pixel positions relative to the image width and height, and z represents depth relative to the wrist. The 21 points are distributed as follows:

Index	Landmark	Description
0	WRIST	Base anchor point for all normalisation
1–4	THUMB	CMC → MCP → IP → TIP
5–8	INDEX FINGER	MCP → PIP → DIP → TIP
9–12	MIDDLE FINGER	MCP → PIP → DIP → TIP
13–16	RING FINGER	MCP → PIP → DIP → TIP
17–20	PINKY FINGER	MCP → PIP → DIP → TIP

Landmark 0 (WRIST) is the anchor for all normalisation. Landmark 9 (MIDDLE MCP) defines the span.

How MediaPipe is Used in SignSpeak

Since sign language frequently involves both hands, SignSpeak configures MediaPipe to track up to two hands simultaneously. For each detected hand, the (x, y, z) values of all 21 landmarks are extracted, producing a raw vector of 63 values per hand. With two hands, this gives 126 raw values per frame.

To make the model robust to differences in hand size, camera distance, and position within the frame, these coordinates are normalised before being used as model input:

Wrist-relative: Each coordinate is shifted by subtracting the wrist position (landmark 0), so the wrist is always at the origin (0, 0, 0).
Scale-normalised: Each shifted coordinate is divided by the Euclidean distance between the wrist and the middle finger MCP joint (landmark 9). This "span" acts as a scale factor, so the same sign made close to or far from the camera produces an identical feature vector.

The formula in Python is span = np.linalg.norm(mid_mcp - wrist). In Dart (Flutter), the equivalent must use sqrt(dx*dx + dy*dy + dz*dz) - not the squared distance - to match the training data exactly.

⚠️ Critical: Feature Extraction Must Match Exactly
The single most common cause of poor accuracy when moving from Python to Flutter is a mismatch in normalisation. The Python training pipeline uses np.linalg.norm (Euclidean distance, i.e. square root). The Flutter inference code must use the identical formula. Using squared distance instead will produce a completely different scale and render the model unreliable.

2. Gemini Nano: Bringing Intelligence to the Edge

What is Gemini Nano?

Gemini Nano is the smallest and most efficient model in Google's Gemini family, purpose-built for on-device inference. Unlike cloud-based LLMs, Gemini Nano runs entirely on the device's hardware using Android's AICore runtime, requiring no internet connection and sending no user data off-device. It is available on supported devices including the Pixel 8 series and Samsung Galaxy S24 series.

Why an LLM for Sign Language?

ASL has a grammatically different structure from spoken English. Signers routinely omit articles (a, the), auxiliary verbs (is, are, was), and tense markers. A direct transcription of detected signs would produce telegraphic output such as "WATER NEED" or "NAME MY". Gemini Nano's role in SignSpeak is to act as an interpreter: it receives the raw sign tokens and a context-aware prompt, and returns a fluent English sentence (because LLMs are only next word predictors am I right :D). The prompt is designed with few-shot examples:

You are a sign language interpreter. Convert ASL sign tokens into
natural fluent English sentences.

WATER NEED       → I need some water please.
NAME MY NOOR     → My name is Noor.
HELP ME PLEASE   → Could you please help me?

Output ONLY the final sentence. No explanation. Under 15 words.

How to Install and Enable Gemini Nano on Android

Gemini Nano is accessed via the AICore system service. Enabling it requires the following steps on a supported device:

Update AICore: Open the Google Play Store and search for "AICore". Ensure it is updated to the latest version.
Enable Developer Options: Go to Settings → About Phone and tap Build Number seven times until Developer Mode is enabled.
Enable Gemini Nano: In Settings → System → Developer Options, scroll to find "Gemini Nano" or search for "AICore Settings". Toggle on "Enable Gemini Nano" and "Enable On-Device Model".
Wait for model download: The on-device model downloads silently in the background. This can take 10–15 minutes on Wi-Fi. The device must be charging and connected to Wi-Fi. The model is not available until this download completes.
Verify availability: Connect the device via USB and run:

adb shell cmd aicore status

The output should confirm the model is downloaded and AICore is running. In SignSpeak, Gemini Nano is accessed via a Flutter MethodChannel that bridges to the Android-native AICore SDK. If the on-device model is unavailable, the app transparently falls back to the Gemini Flash API (cloud) or a rule-based sentence assembler.

3. The Python Pipeline: From Camera to Trained Model

The gesture recognition model is built using a three-stage Python pipeline located in the custom_model/ directory. All three scripts share the same MediaPipe feature extraction logic to ensure consistency between training data and inference.

collect.py - The Data Gatherer

Purpose

collect.py uses your PC webcam and MediaPipe to capture hand landmark data for each sign you want to recognise. Instead of recording raw video or images, it records only the 126 normalised landmark coordinates per frame - this is both compact and already in the exact format the model needs.

How to Use

Install dependencies:

pip install mediapipe opencv-python numpy

Run the script from the custom_model/ directory:

python collect.py

The webcam window opens showing a live hand skeleton overlay. For each sign in the SIGNS list:

Get into position with your hand(s) in the correct gesture.
Press SPACE. A 5-second countdown begins - use this time to settle into the pose.
The script automatically captures 10 frames at 0.5-second intervals. A white border flashes on each capture. The terminal prints each shot as OK or MISSED.
When you have enough samples for a sign, press N to advance to the next one.
Press Q at any time to quit and save all collected data so far.

ℹ️ Two-handed signs: Signs marked as TWO_HANDED (e.g. WELCOME, HELP, MORE) display a purple [TWO-HANDED] label. The "Hands:" counter in the top-right turns green only when the required number of hands are detected. For two-handed signs, ensure both hands are fully visible in the frame before pressing SPACE.

Customising the Sign Vocabulary

The list of signs to collect is defined at the top of the script:

SIGNS = [
    'none',       # always required as the negative class
    'hello',
    'my',
    'name',
    'noor',       # a custom sign - any gesture you define
    'welcome',
    # add more here...
]

'none' must always be first. It is the negative class - the gesture the model predicts when no recognisable sign is being made. Every other sign can be any word, name, or phrase you choose. The gesture itself is entirely up to you: pick any distinct, static hand shape for each token.

Adding New Signs Without Recollecting Everything

Link to github for trying out the gesture recognition: https://github.com/Y-Noor/bwai-2026-custom-model-mediapipe

One of the most practical features of this pipeline is the ability to add new signs incrementally. Because the training data is stored as a plain CSV file, you can append rows for new signs without discarding the data you have already collected.

To add a new sign to an existing dataset:

Add the new sign to the SIGNS list in collect.py. Place it at the end, after all existing signs. Do not reorder or remove any existing entries - the label index (position in the list) must stay the same.
Set the starting index: At the bottom of the script, set sign_idx to the index of your new sign so collection jumps straight to it without re-running existing signs.
Run collect.py: New rows will be appended to the existing data/landmarks.csv. All previous sign data is preserved.
Update labels.txt: The script automatically rewrites data/labels.txt with the full updated list.
Retrain: Run python train.py. It reads the updated CSV, detects all present classes, and trains a new model with the expanded vocabulary.

ℹ️ Why you must retrain even for one new sign: The output layer of the MLP is sized to the number of classes. Adding a new sign changes that count, so a new model must be exported. However, all the previously collected data remains valid and is automatically included - you only spend time collecting samples for the new sign.

Files Generated by collect.py

data/landmarks.csv - one row per captured frame, containing a label index followed by 126 normalised landmark coordinates.
data/labels.txt - the ordered list of sign names, one per line. The line number (zero-indexed) is the label ID used in the CSV.

train.py - The Brain Builder

Purpose

train.py reads the collected CSV, trains a Multi-Layer Perceptron (MLP) neural network using TensorFlow, and exports the result as a quantised TFLite model ready for mobile deployment.

How to Use

Install dependencies:

pip install tensorflow scikit-learn pandas matplotlib

Run:

python train.py

Training typically completes in 1–3 minutes. EarlyStopping monitors validation accuracy and halts training once the model stops improving, so you do not need to manually tune the number of epochs.

Model Architecture

The classifier is a four-layer MLP. The input is the 126-value normalised feature vector; the output is a softmax probability distribution over the number of sign classes:

Input:   126 features
         ↓
Dense 128  + BatchNorm + Dropout(0.3)
         ↓
Dense 64   + BatchNorm + Dropout(0.2)
         ↓
Dense 32
         ↓
Dense N    + Softmax         (N = number of sign classes)

BatchNorm stabilises training between layers. Dropout prevents the network from over-relying on any single feature, which matters because many signs share similar finger configurations. The final Softmax layer outputs a confidence percentage for each class.

Files Generated by train.py

exported_model/gesture_classifier.tflite - the trained model, quantised to INT8 (~45 KB). This is the file loaded by Flutter.
exported_model/gesture_labels.txt - the label list corresponding to the model's output indices. Must be copied to Flutter alongside the .tflite file.
exported_model/training_curves.png - accuracy and loss curves across epochs for both training and validation sets.
exported_model/confusion_matrix.png - a grid showing true vs. predicted labels across the test set. Signs on the diagonal are correctly classified; off-diagonal entries reveal confusions between visually similar signs.

test_model.py - The Real-World Check

Purpose

test_model.py lets you verify the trained model against your own live webcam before deploying to a phone. It is the fastest way to catch issues with specific signs before going through the Flutter build cycle.

How to Use

Install dependencies:

pip install tflite-runtime opencv-python

Run:

python test_model.py

The script opens your webcam, runs hand landmark detection, and performs inference on every frame. The display shows:

The predicted sign label in large text at the bottom of the frame, greyed out when confidence is below the 75% threshold.
A confidence percentage next to the label.
A ranked bar chart of the top 3 predictions on the right side - useful for diagnosing which signs are being confused with each other.
A live FPS counter and hand count in the top bar.

Target: Aim for consistent confident predictions (≥85%) on all signs before moving the model to Flutter. If two signs are frequently confused, check the confusion matrix from train.py and collect additional samples for those specific classes.

4. Importing the Trained Model into Flutter

Once test_model.py confirms the model performs well, it can be integrated into the Flutter app in five steps.

Step 1: Copy Model Files into the Project

copy exported_model\gesture_classifier.tflite  signspeak\assets\models\
copy exported_model\gesture_labels.txt         signspeak\assets\models\

Step 2: Register Assets in pubspec.yaml

Both files must be declared in pubspec.yaml so the Flutter build system bundles them into the app:

flutter:
  uses-material-design: true
  assets:
    - assets/models/

Step 3: Add the tflite_flutter Dependency

dependencies:
  tflite_flutter: ^0.12.1

Then run flutter pub get.

Step 4: Load and Run the Model

The TfliteGestureClassifier service handles loading, feature extraction, and inference:

// Initialise once at app startup
_interpreter = await Interpreter.fromAsset(
  'assets/models/gesture_classifier.tflite',
  options: InterpreterOptions()..threads = 2,
);

final labelData = await rootBundle.loadString(
  'assets/models/gesture_labels.txt',
);
_labels = labelData.trim().split('\n');

At inference time, the raw MediaPipe landmarks are normalised and passed to the interpreter:

// Build the 126-feature input vector
final input  = [extractFeatures(allHands).toList()];  // [[f0..f125]]
final output = List.generate(1, (_) => List.filled(_labels.length, 0.0));

_interpreter.run(input, output);

// Map index to label
final maxIdx = output[0].indexWhere((p) => p == output[0].reduce(max));
final label  = _labels[maxIdx];
final conf   = output[0][maxIdx];

Step 5: Normalise Identically to collect.py

The feature extraction in Dart must be byte-for-byte equivalent to the Python normalisation. The critical rule is to use the square root (Euclidean norm), not the squared distance:

// Dart - must match collect.py exactly
final dx   = lm[9].x - wristX;
final dy   = lm[9].y - wristY;
final dz   = lm[9].z - wristZ;
final span = sqrt(dx * dx + dy * dy + dz * dz) + 1e-6;  // dart:math sqrt

for (int i = 0; i < 21; i++) {
  out[offset + i*3 + 0] = (lm[i].x - wristX) / span;
  out[offset + i*3 + 1] = (lm[i].y - wristY) / span;
  out[offset + i*3 + 2] = (lm[i].z - wristZ) / span;
}

✅ The app automatically detects the model. If gesture_classifier.tflite is present in assets/models/, TfliteGestureClassifier loads it on startup and the app switches from the built-in rule-based classifier to the trained model. The debug console prints TFLite classifier loaded: N classes to confirm. If the file is absent, the app falls back to geometry rules silently.

SignSpeak - Build with AI 2026
MediaPipe · Gemini Nano · TensorFlow Lite · Flutter

Top comments (1)

Harjot Singh • Jun 1

really interesting how SignSpeak leverages on-device tech to provide real-time translation without needing an internet connection. at Moonshift, we help developers get full next.js + postgres + auth apps deployed in about 7 minutes, and you own the code on your github. if you're curious, I can set you up with a free run to see how it works.