From Pixels to Predictions: Data Pipelines and Training the Sequence Model (Part 2)

#ai #python #asl #machinelearning

In Part 1 of this series, we introduced the architecture of the asl-to-voice translation system—a five-stage pipeline designed to turn real-time webcam video into spoken English. But a machine learning model is only as good as the data it learns from, and in the world of computer vision, raw video is often too noisy, heavy, and unstructured to be useful directly.

In this article, we dive into the data layer: how we extract meaningful signals from raw video, normalize them for robust inference, and train our temporal sequence model.

The Data Foundation: WLASL and Beyond

To teach a neural network to understand sign language, we need massive amounts of annotated video. The project supports several public datasets:

WLASL (Word-Level American Sign Language): Contains over 2,000 signs performed by over 100 signers. We use this as our primary baseline, often starting with a top-50 sign subset for rapid iteration.
RWTH-PHOENIX-2014T: A dataset of continuous German Sign Language with rich gloss annotations.
How2Sign: A large-scale, continuous ASL dataset.

We built custom scripts (like scripts/download_wlasl.py) to scrape, organize, and format these datasets automatically, preparing them for the extraction phase.

Stage 1: Keypoint Extraction with MediaPipe

Passing raw RGB frames directly into a temporal model (like a 3D CNN or Vision Transformer) requires massive computational power—usually a high-end GPU. Because our goal is real-time inference on consumer hardware, we take a different approach: Skeletonization.

Using Google's MediaPipe Holistic framework, we process the video frame-by-frame, extracting the 3D coordinates (x, y, z) of specific landmarks on the human body.

In models/keypoint_extractor.py, we construct a dense feature vector for every frame:

Hands: 21 landmarks per hand × 3 dimensions = 126 dims.
Pose (Body): 33 landmarks × 4 dimensions (including visibility) = 132 dims.
Face: The full face mesh is 468 points (1,404 dims), which is often overkill. We provide a configuration toggle to extract just the mouth subset (~20 landmarks = ~60 dims). Mouth shapes are critical for non-manual markers in ASL.

By default, we compress millions of pixels into a highly informative 1,662-dimensional vector per frame (including the full face mesh).

The Secret Sauce: Normalization

If the model trains on a person standing in the center of the frame, it will fail if the user stands in the bottom left corner. To solve this, we implemented shoulder-based normalization.

Before the keypoints are saved, we calculate the midpoint between the left and right shoulder landmarks (Pose points 11 and 12). We then translate all other keypoints so that this shoulder midpoint becomes the origin (0,0,0). This makes our data translation-invariant—the model only cares about how the hands and face move relative to the body, not where the body is in the camera frame.

Stage 2: The Temporal Sequence Model

With our videos converted into sequences of normalized 1,662-dimensional vectors, we are ready to train. The core of this system is the Transformer Encoder (defined in models/sequence_model.py).

Why a Transformer? While Recurrent Neural Networks (like our BiLSTM baseline) are good at sequence data, Transformers excel at modeling long-range dependencies and parallelize beautifully on modern hardware.

Our default architecture (configured via config.yaml):

Input Projection: A linear layer scales the 1,662-dim input up to the model's hidden dimension (e.g., 256).
Positional Encoding: Standard sinusoidal encodings are injected so the self-attention mechanism understands the order of time.
Encoder Blocks: 6 layers of multi-head self-attention (8 heads) allow the model to look at the entire sequence of keypoints and understand the context of the sign.
CTC Head: A final linear layer projects the hidden state to our vocabulary size, followed by a log-softmax activation.

Training with CTC Loss

In continuous sign language, we don't know exactly when a sign starts and stops in the video. We just know the video contains the glosses ["HELLO", "WORLD"].

To solve this alignment problem, we train the network using Connectionist Temporal Classification (CTC) loss. CTC allows the model to predict a sequence of tokens from an unsegmented input stream by introducing a special <BLANK> token. The model learns to predict blanks during the transitions between signs, and spikes the probability of a specific sign when it recognizes it.

Our training script (training/train_sequence.py) utilizes PyTorch's native nn.CTCLoss(zero_infinity=True), paired with an Adam optimizer, learning rate schedulers (ReduceLROnPlateau), and gradient clipping to stabilize the somewhat notoriously unstable CTC training process.

Measuring Success

During training, standard loss metrics aren't enough. We evaluate our models using Word Error Rate (WER) via the jiwer library. WER measures how many insertions, deletions, and substitutions are required to turn the predicted gloss sequence into the ground truth sequence. The lower the WER, the better the model.

Next Steps

Now we have a trained Transformer model capable of taking a sequence of keypoints and spitting out a sequence of gloss probabilities. But how do we do this live, on a webcam, without knowing when the user starts or stops signing?

In Part 3, we will explore the real-time inference loop, the magic of sliding windows, and how we translate robotic glosses into beautiful, spoken English.

uploaded through Distroblog - a platform i created specifically to post to multiple blog sites at once😅