Bringing it to Life: The Real-Time Inference Engine (Part 3)

#ai #transformer #machinelearning

In Part 2, we successfully trained a Transformer model to map sequences of body keypoints to sign language glosses using CTC loss. However, training on pre-segmented videos is one thing; making it work in the real world—where a webcam stream is infinite and boundaries are unknown—is an entirely different beast.

In this article, we tear down inference/realtime.py, the beating heart of the asl-to-voice project. We will explore how we handle infinite video streams, decode raw probabilities into words, and use Large Language Models (LLMs) to generate beautiful, spoken English on the fly.

Stage 3: The Sliding Window and CTC Decoding

When a user turns on their webcam, we don't know when a sentence begins or ends. To solve this, we implemented a Sliding Window architecture.

As the camera captures frames, MediaPipe extracts the keypoints and appends them to a collections.deque (a highly efficient queue). We maintain a window of W frames (e.g., 64 frames, representing about 2 seconds of video).

Every S frames (the "stride", e.g., 16 frames), we take the current window, convert it to a PyTorch tensor, and push it through our Transformer model. This means the model is constantly analyzing overlapping chunks of time, ensuring we never "cut off" a sign in the middle of an inference step.

Making Sense of the Output

The Transformer outputs a probability distribution across our entire vocabulary for every frame in that 64-frame window. How do we turn that into words?

In models/gloss_decoder.py, we implement CTC decoding. We offer two strategies:

Greedy Search (Default): For every time step, simply pick the word with the highest probability. We then collapse consecutive duplicate words and remove the <BLANK> tokens. It's incredibly fast and works well for clear, distinct signs.
Beam Search: Instead of just looking at the top choice, Beam Search keeps track of the top K (the beam width) most likely paths through the probabilities. It's computationally heavier but significantly more accurate, especially when the model is slightly unsure.

Stage 4: The LLM Translation Layer

At this point, our decoder might output a sequence of glosses like: ["STORE", "I", "GO"].

To a hearing person, this sounds broken. Sign languages have their own distinct grammar and syntax. To make the system truly accessible and natural, we must translate these literal glosses into fluent English: "I am going to the store."

This is where models/gloss_to_text.py comes in. We treat the gloss-to-English translation as a standard NLP translation task, leveraging modern Large Language Models (LLMs).

The Fallback Chain

Relying on a single cloud API in a real-time system is dangerous. If the API rate-limits you or goes down, the application breaks. To guarantee reliability, we built an intelligent fallback chain.

Primary: Google Gemini (gemini-3.1-flash-lite-preview). It is blazingly fast, highly accurate, and extremely cost-effective for this type of few-shot translation.
Fallback 1: OpenAI (gpt-5.4-mini). If Gemini times out or throws a server error, the system automatically routes the exact same prompt to OpenAI.
Fallback 2: Anthropic (claude-haiku-4-5-20251001). Our final safety net.

We use a carefully crafted system prompt:

"You are a sign language interpreter. Convert the following sign language gloss sequence into a natural, fluent English sentence. Output only the sentence, nothing else. Preserve the original meaning exactly."

By using these ultra-fast, lightweight LLMs, the translation usually takes less than 500 milliseconds.

Stage 5: Text-to-Speech (Without Freezing)

The final step is to read the translated sentence aloud. If we simply called a Text-to-Speech (TTS) function in our main while True webcam loop, the entire video feed would freeze while the computer spoke.

To solve this, inference/tts.py implements a multi-threaded, non-blocking audio engine.

When the LLM returns a sentence, the main thread pushes that string into a thread-safe queue.Queue. A dedicated background worker thread constantly watches this queue. When it sees new text, it synthesizes the audio and plays it. The main webcam loop never waits, meaning the video feed stays at a buttery smooth 30 FPS.

We unified three different TTS backends behind a single interface:

Edge TTS (Primary): This utilizes Microsoft Edge's internal API to access incredibly high-quality, neural text-to-speech voices for free, without requiring an API key.
pyttsx3: A fully offline fallback that uses the host OS's native voices.
ElevenLabs: For users who want ultra-realistic, premium voices (requires an API key).

The User Experience

We wrap all of this in a sleek, real-time OpenCV window (utils/visualize.py). As the user signs, the MediaPipe skeleton is drawn on their body. A clean HUD overlays the screen, showing the current raw gloss predictions in gray, and the final translated English sentence in bright green just before the computer speaks it aloud.

With the core pipeline running live, what happens if you want to run this in a remote village with no internet? Or what if you want to teach it a sign language it's never seen before?

In the final installment, Part 4, we will explore the advanced features of the codebase: offline translation models, custom sign recording tools, and exporting to ONNX for massive performance gains.

uploaded through Distroblog - a platform i created specifically to post to multiple blog sites at once😅