DEV Community: Bright Etornam Sunu

Beyond the Basics: Offline Models, Custom Signs, and Production Scaling (Part 4)

Bright Etornam Sunu — Sat, 16 May 2026 17:37:56 +0000

Over the last three articles, we’ve walked through the creation of a real-time sign language translation system—from extracting body keypoints to training a CTC sequence model to running a live LLM-powered inference loop.

While the core pipeline is highly functional, moving a project from a cool tech demo to a robust, production-ready application requires addressing edge cases, hardware constraints, and user customization.

In this final installment, we will explore the advanced tooling included in the asl-to-voice codebase, designed to make the system faster, fully offline, and adaptable to new languages.

Cutting the Cord: 100% Offline Translation

In Part 3, we highlighted our resilient LLM fallback chain (Gemini → OpenAI → Anthropic). This works beautifully, provided you have a fast, stable internet connection. But what if this system is deployed on a mobile tablet in a rural area without cellular service? What if the deployment environment has strict privacy constraints that prohibit sending video or text data to the cloud?

To solve this, we built training/train_gloss_nlp.py.

This script allows you to train your own local Seq2Seq NLP model (such as T5-small or MarianMT) to perform the Gloss → English translation completely offline.

By feeding the script pairs of gloss sequences and their corresponding natural English sentences, you can fine-tune a lightweight language model. Once trained, you simply update config.yaml to point to your local checkpoint:

gloss_to_text:
  backend: seq2seq
  seq2seq_model: checkpoints/gloss_nlp_best/

Paired with the pyttsx3 offline TTS backend, the entire pipeline—computer vision, sequence modeling, translation, and speech synthesis—now runs 100% locally on the user's hardware. No internet required.

Democratizing Languages: The Custom Sign Recorder

Most sign language datasets focus heavily on American Sign Language (ASL). However, sign languages are not universal. British Sign Language (BSL), French Sign Language (LSF), and Ghanaian Sign Language (GSL) are all distinct languages with different vocabularies and grammatical rules.

If a researcher wants to adapt this codebase for a new, under-resourced language, they likely don't have a massive pre-recorded dataset like WLASL available to download.

We built scripts/record_signs.py to empower anyone to build their own dataset from scratch. It is a fully interactive, terminal-based recording tool.

When you run it, you specify the signs you want to teach the model. The script provides a 3-second visual countdown on your webcam feed, records your motion, and immediately plays it back. If you are happy with the sample, press SPACE to save it; if you messed up, press D to discard it. It even supports a batch mode (--samples-per-sign 5) so you can quickly build a robust, varied dataset of your own customized vocabulary.

Squeezing Out Performance: ONNX Export

Our default Transformer model, written in PyTorch, runs reasonably fast. But in production environments—especially on edge devices or mobile phones—PyTorch is often too heavy a dependency.

To achieve maximum performance, we wrote scripts/export_onnx.py.

This script takes your trained PyTorch .pt checkpoint and mathematically converts it into an ONNX (Open Neural Network Exchange) graph.

Why ONNX?

Speed: ONNX Runtime is heavily optimized in C++. By exporting the model, inference speeds typically increase by 2x to 5x.
Portability: ONNX models can run anywhere. You can drop the exported sign_model.onnx file into a C# Windows application, an Android app (via ONNX Runtime Mobile), or even a web browser (via ONNX Runtime Web using WebAssembly).
Slimmer Dependencies: You no longer need to ship the massive PyTorch library with your final application.

The Road Ahead

The asl-to-voice project proves that real-time, continuous sign language translation is possible today using consumer hardware and open-source models. But the journey doesn't end here.

The architecture we've built is extensible by design. Our future roadmap includes:

Multi-Signer Handling: Updating the keypoint extractor to track multiple people in a frame simultaneously, distinguishing between who is speaking and who is listening.
Wearable Integration: Adapting the pipeline to run on smart AR glasses (like Meta Ray-Bans), where the camera angle is first-person rather than third-person.
Discourse Modeling: Allowing the translation layer to remember the context of the conversation over several minutes, rather than translating sentence-by-sentence in a vacuum.

We invite the community to clone the repository, test the live demo, and contribute new signs and improvements. Together, we can build technology that truly bridges the silence.

Thank you for reading this 4-part series on the asl-to-voice architecture. The full codebase is available for exploration.

uploaded through Distroblog - a platform i created specifically to post to multiple blog sites at once😅

Bringing it to Life: The Real-Time Inference Engine (Part 3)

Bright Etornam Sunu — Fri, 24 Apr 2026 20:05:35 +0000

In Part 2, we successfully trained a Transformer model to map sequences of body keypoints to sign language glosses using CTC loss. However, training on pre-segmented videos is one thing; making it work in the real world—where a webcam stream is infinite and boundaries are unknown—is an entirely different beast.

In this article, we tear down inference/realtime.py, the beating heart of the asl-to-voice project. We will explore how we handle infinite video streams, decode raw probabilities into words, and use Large Language Models (LLMs) to generate beautiful, spoken English on the fly.

Stage 3: The Sliding Window and CTC Decoding

When a user turns on their webcam, we don't know when a sentence begins or ends. To solve this, we implemented a Sliding Window architecture.

As the camera captures frames, MediaPipe extracts the keypoints and appends them to a collections.deque (a highly efficient queue). We maintain a window of W frames (e.g., 64 frames, representing about 2 seconds of video).

Every S frames (the "stride", e.g., 16 frames), we take the current window, convert it to a PyTorch tensor, and push it through our Transformer model. This means the model is constantly analyzing overlapping chunks of time, ensuring we never "cut off" a sign in the middle of an inference step.

Making Sense of the Output

The Transformer outputs a probability distribution across our entire vocabulary for every frame in that 64-frame window. How do we turn that into words?

In models/gloss_decoder.py, we implement CTC decoding. We offer two strategies:

Greedy Search (Default): For every time step, simply pick the word with the highest probability. We then collapse consecutive duplicate words and remove the <BLANK> tokens. It's incredibly fast and works well for clear, distinct signs.
Beam Search: Instead of just looking at the top choice, Beam Search keeps track of the top K (the beam width) most likely paths through the probabilities. It's computationally heavier but significantly more accurate, especially when the model is slightly unsure.

Stage 4: The LLM Translation Layer

At this point, our decoder might output a sequence of glosses like: ["STORE", "I", "GO"].

To a hearing person, this sounds broken. Sign languages have their own distinct grammar and syntax. To make the system truly accessible and natural, we must translate these literal glosses into fluent English: "I am going to the store."

This is where models/gloss_to_text.py comes in. We treat the gloss-to-English translation as a standard NLP translation task, leveraging modern Large Language Models (LLMs).

The Fallback Chain

Relying on a single cloud API in a real-time system is dangerous. If the API rate-limits you or goes down, the application breaks. To guarantee reliability, we built an intelligent fallback chain.

Primary: Google Gemini (gemini-3.1-flash-lite-preview). It is blazingly fast, highly accurate, and extremely cost-effective for this type of few-shot translation.
Fallback 1: OpenAI (gpt-5.4-mini). If Gemini times out or throws a server error, the system automatically routes the exact same prompt to OpenAI.
Fallback 2: Anthropic (claude-haiku-4-5-20251001). Our final safety net.

We use a carefully crafted system prompt:

"You are a sign language interpreter. Convert the following sign language gloss sequence into a natural, fluent English sentence. Output only the sentence, nothing else. Preserve the original meaning exactly."

By using these ultra-fast, lightweight LLMs, the translation usually takes less than 500 milliseconds.

Stage 5: Text-to-Speech (Without Freezing)

The final step is to read the translated sentence aloud. If we simply called a Text-to-Speech (TTS) function in our main while True webcam loop, the entire video feed would freeze while the computer spoke.

To solve this, inference/tts.py implements a multi-threaded, non-blocking audio engine.

When the LLM returns a sentence, the main thread pushes that string into a thread-safe queue.Queue. A dedicated background worker thread constantly watches this queue. When it sees new text, it synthesizes the audio and plays it. The main webcam loop never waits, meaning the video feed stays at a buttery smooth 30 FPS.

We unified three different TTS backends behind a single interface:

Edge TTS (Primary): This utilizes Microsoft Edge's internal API to access incredibly high-quality, neural text-to-speech voices for free, without requiring an API key.
pyttsx3: A fully offline fallback that uses the host OS's native voices.
ElevenLabs: For users who want ultra-realistic, premium voices (requires an API key).

The User Experience

We wrap all of this in a sleek, real-time OpenCV window (utils/visualize.py). As the user signs, the MediaPipe skeleton is drawn on their body. A clean HUD overlays the screen, showing the current raw gloss predictions in gray, and the final translated English sentence in bright green just before the computer speaks it aloud.

With the core pipeline running live, what happens if you want to run this in a remote village with no internet? Or what if you want to teach it a sign language it's never seen before?

In the final installment, Part 4, we will explore the advanced features of the codebase: offline translation models, custom sign recording tools, and exporting to ONNX for massive performance gains.

uploaded through Distroblog - a platform i created specifically to post to multiple blog sites at once😅

From Pixels to Predictions: Data Pipelines and Training the Sequence Model (Part 2)

Bright Etornam Sunu — Fri, 17 Apr 2026 23:21:40 +0000

In Part 1 of this series, we introduced the architecture of the asl-to-voice translation system—a five-stage pipeline designed to turn real-time webcam video into spoken English. But a machine learning model is only as good as the data it learns from, and in the world of computer vision, raw video is often too noisy, heavy, and unstructured to be useful directly.

In this article, we dive into the data layer: how we extract meaningful signals from raw video, normalize them for robust inference, and train our temporal sequence model.

The Data Foundation: WLASL and Beyond

To teach a neural network to understand sign language, we need massive amounts of annotated video. The project supports several public datasets:

WLASL (Word-Level American Sign Language): Contains over 2,000 signs performed by over 100 signers. We use this as our primary baseline, often starting with a top-50 sign subset for rapid iteration.
RWTH-PHOENIX-2014T: A dataset of continuous German Sign Language with rich gloss annotations.
How2Sign: A large-scale, continuous ASL dataset.

We built custom scripts (like scripts/download_wlasl.py) to scrape, organize, and format these datasets automatically, preparing them for the extraction phase.

Stage 1: Keypoint Extraction with MediaPipe

Passing raw RGB frames directly into a temporal model (like a 3D CNN or Vision Transformer) requires massive computational power—usually a high-end GPU. Because our goal is real-time inference on consumer hardware, we take a different approach: Skeletonization.

Using Google's MediaPipe Holistic framework, we process the video frame-by-frame, extracting the 3D coordinates (x, y, z) of specific landmarks on the human body.

In models/keypoint_extractor.py, we construct a dense feature vector for every frame:

Hands: 21 landmarks per hand × 3 dimensions = 126 dims.
Pose (Body): 33 landmarks × 4 dimensions (including visibility) = 132 dims.
Face: The full face mesh is 468 points (1,404 dims), which is often overkill. We provide a configuration toggle to extract just the mouth subset (~20 landmarks = ~60 dims). Mouth shapes are critical for non-manual markers in ASL.

By default, we compress millions of pixels into a highly informative 1,662-dimensional vector per frame (including the full face mesh).

The Secret Sauce: Normalization

If the model trains on a person standing in the center of the frame, it will fail if the user stands in the bottom left corner. To solve this, we implemented shoulder-based normalization.

Before the keypoints are saved, we calculate the midpoint between the left and right shoulder landmarks (Pose points 11 and 12). We then translate all other keypoints so that this shoulder midpoint becomes the origin (0,0,0). This makes our data translation-invariant—the model only cares about how the hands and face move relative to the body, not where the body is in the camera frame.

Stage 2: The Temporal Sequence Model

With our videos converted into sequences of normalized 1,662-dimensional vectors, we are ready to train. The core of this system is the Transformer Encoder (defined in models/sequence_model.py).

Why a Transformer? While Recurrent Neural Networks (like our BiLSTM baseline) are good at sequence data, Transformers excel at modeling long-range dependencies and parallelize beautifully on modern hardware.

Our default architecture (configured via config.yaml):

Input Projection: A linear layer scales the 1,662-dim input up to the model's hidden dimension (e.g., 256).
Positional Encoding: Standard sinusoidal encodings are injected so the self-attention mechanism understands the order of time.
Encoder Blocks: 6 layers of multi-head self-attention (8 heads) allow the model to look at the entire sequence of keypoints and understand the context of the sign.
CTC Head: A final linear layer projects the hidden state to our vocabulary size, followed by a log-softmax activation.

Training with CTC Loss

In continuous sign language, we don't know exactly when a sign starts and stops in the video. We just know the video contains the glosses ["HELLO", "WORLD"].

To solve this alignment problem, we train the network using Connectionist Temporal Classification (CTC) loss. CTC allows the model to predict a sequence of tokens from an unsegmented input stream by introducing a special <BLANK> token. The model learns to predict blanks during the transitions between signs, and spikes the probability of a specific sign when it recognizes it.

Our training script (training/train_sequence.py) utilizes PyTorch's native nn.CTCLoss(zero_infinity=True), paired with an Adam optimizer, learning rate schedulers (ReduceLROnPlateau), and gradient clipping to stabilize the somewhat notoriously unstable CTC training process.

Measuring Success

During training, standard loss metrics aren't enough. We evaluate our models using Word Error Rate (WER) via the jiwer library. WER measures how many insertions, deletions, and substitutions are required to turn the predicted gloss sequence into the ground truth sequence. The lower the WER, the better the model.

Next Steps

Now we have a trained Transformer model capable of taking a sequence of keypoints and spitting out a sequence of gloss probabilities. But how do we do this live, on a webcam, without knowing when the user starts or stops signing?

In Part 3, we will explore the real-time inference loop, the magic of sliding windows, and how we translate robotic glosses into beautiful, spoken English.

uploaded through Distroblog - a platform i created specifically to post to multiple blog sites at once😅

Bridging the Silence: Building a Real-Time Sign Language Translator (Part 1)

Bright Etornam Sunu — Fri, 10 Apr 2026 11:55:49 +0000

Communication is the cornerstone of human connection, yet for the deaf and hard-of-hearing communities, a significant barrier exists when interacting with those who don't understand sign language. While text-to-speech and speech-to-text technologies have advanced rapidly, translating visual, spatial languages like American Sign Language (ASL) into spoken word in real-time has remained a formidable challenge.

Enter the asl-to-voice project.

In this four-part technical series, we will take a deep dive into how we built an end-to-end, continuous sign language recognition (CSLR) pipeline. This system doesn't just recognize isolated gestures; it watches a person signing via a standard webcam, understands the continuous flow of movements, translates those signs into fluent English, and speaks the translation aloud—all in real time.

The Complexity of Sign Language

Before diving into the architecture, it's crucial to understand why this is hard. Sign language isn't just about hand shapes. It involves complex grammar, facial expressions, body posture, and non-manual markers. Furthermore, in natural signing, there are no clear "spaces" between words like there are in written text. This is known as continuous signing.

Traditional computer vision approaches often fail here because they treat signs as static images or isolated video clips. To truly translate sign language, a system must understand spatio-temporal data (space and time) simultaneously. Furthermore, sign language has its own syntax, often represented as "glosses" (capitalized literal representations of signs, like YOU NAME WHAT), which don't map one-to-one to English grammar.

Our Solution: The 5-Stage Pipeline

To tackle these challenges, the asl-to-voice codebase is structured around a highly modular, five-stage pipeline. By breaking the problem down, we can optimize each component independently.

Stage 1: Keypoint Extraction

Processing raw video frames through a heavy neural network is computationally expensive and slow. Instead, we use Google's MediaPipe Holistic to extract 2D and 3D landmarks from the signer's hands, body (pose), and face. This dramatically reduces the dimensionality of our data from millions of pixels down to a dense feature vector (up to 1,662 dimensions) that describes exactly how the body is moving.

Stage 2: Temporal Sequence Modeling

With our stream of keypoints, we need a model that understands time. We implemented a Transformer encoder (with a BiLSTM available as a baseline). The Transformer looks at a sliding window of keypoint frames and learns the temporal relationships between them. Because we are dealing with continuous streams without explicit word boundaries, the model is trained using Connectionist Temporal Classification (CTC) loss.

Stage 3: Gloss Decoding

The output of the Transformer is a probability distribution over our vocabulary of signs. The CTC decoder (using either a fast greedy search or a more accurate beam search) collapses these probabilities into a discrete sequence of glosses.

Stage 4: Gloss to Natural Language

If the system outputs ["STORE", "I", "GO"], the user experience is poor. Sign language glosses need to be translated into grammatically correct spoken language. To achieve this, we route the gloss sequence through a Large Language Model (LLM). Our system uses a resilient fallback chain: it tries Google Gemini first, falls back to OpenAI if it fails, and then to Anthropic.

Stage 5: Text-to-Speech (TTS)

Finally, the natural English sentence (e.g., "I am going to the store.") is sent to a TTS engine. We rely primarily on Microsoft Edge TTS for high-quality, neural voice generation. Crucially, this runs in a background thread so the video feed and inference loop never freeze while the computer is speaking.

A Look at the Tech Stack

The beauty of this project lies in how these diverse technologies are stitched together:

Computer Vision: mediapipe, opencv-python
Deep Learning: torch, transformers
Metrics: jiwer (for Word Error Rate), sacrebleu
APIs: google-generativeai, openai, anthropic
Audio: edge-tts, pyttsx3

Everything in the system is entirely configuration-driven. A single config.yaml file controls model architectures, feature subsets, sliding window sizes, and API fallback chains, making it incredibly easy to experiment.

What's Next?

In Part 2, we will roll up our sleeves and look at the data. We will explore how we process datasets like WLASL, how we normalize keypoints so the model works regardless of where you stand in the camera frame, and how the Transformer model is actually trained to understand sign language.

Stay tuned as we move from concept to code.

uploaded through Distroblog - a platform i created specifically to post to multiple blog sites at once😅

Deploy your Nodejs + Auth0 REST API to Cyclic.sh under 4 minutes

Bright Etornam Sunu — Thu, 16 Dec 2021 06:15:45 +0000

Deploying APIs can sometimes be a pain in the butt when your service provider overcomplicates the deployment and setup process. This short article will demo how to deploy your Restful Nodejs application to Cyclic.sh in less than 4 minutes.

Yes!, you heard right, less than 4 minutes🔥😱😱.

Cyclic is a provider that helps you Launch your API in seconds. Push your code to Github and let the CI/CD (continuous integration/continuous delivery) integration trigger and deploy your service onto a global infrastructure in seconds. No cryptic CloudFormation errors. No mysterious API Gateway errors. No YAML parse errors. No hunting for CloudWatch log groups. No wasted time.

Important!
I already have my Nodejs Auth0 backend done.

To follow along with this project, clone the repo from here.

Deployment Demo
To deploy your codebase, follow the following steps:
The first thing you must do is create a repository on github.com for your project and push your code.

Next, signup to Cyclic.sh. The signup process is seamless, and all you need is to signup using your Github account.

After successful signup, you will see a dashboard; where all the magic happens. You can locate the docs at the top right corner, just right before the profile.

Now you need to deploy your code. Click the "deploy" button (green button) and select the "Link your own" tab.

Search for the repo you want to deploy, in your case "nodejs-auth0," and select it and connect it to your Github account.

At the prompt, you need to confirm your Github access, and after confirming, all you have to do is to approve and install, and that's it 🎉

Once you approve and install, the deployment process will start. 2–3mins should do it 🎊🎉🎊🎉🎊🎉

The final step is to set your environment variables on the dashboard. The dashboard for your project looks like this.

This is a ".env" file; you can also include those configurations on the dashboard by clicking on "Variables." After this configuration, everything should be up and running 🔥

Deploying new changes
After all the setups and configurations, to deploy new changes, push your code to Github, and Github actions will do the rest 😀😉

Conclusion
Deploying a RESTful API shouldn't be hectic and cyclic.sh has made sure deploying your backend code to the cloud is a simple as possible.

If you find any difficulty in the deployment process, you can reach out to the cyclic.sh team on discord.

Do well to follow me on Twitter and LinkedIn to connect.

originally publish on medium.com

How to limit window resize in Flutter Desktop

Bright Etornam Sunu — Mon, 26 Oct 2020 09:27:34 +0000

Hi there!
It has been such a looooong time since I wrote an article or even created a youtube ☹️ tutorial but hey, there’s good news 🎤, I’M BACK 🔥🔥🔥.

In the past few weeks, I have been working on a Desktop app using Flutter (you know one of those side projects you start then drop it then pick it up again…? yeaaaah!) and I had problems with the window screen resizing.
At some point, you can resize it such that nothing else displays which is an undesirable feature. Hence I started doing some research on how to limit the resize of the window.

After talking to a number of developers and also reading some articles, this effect can “only” be done natively. This requires editing some files for the MacOS, Windows and Linux file folder. This will be a bit stressful especially if the developer is not accustomed to these platforms, I did a deeper research by looking at the source code of existing Flutter desktop applications (FVM GUI) and discovered a google package that does it all for you 🔥🔥🔥 (flutter desktop embedding). Here is a simple example on how to use limit the window size of a desktop app:

After applying the above code example, it was all fixed!!! Whew!!!!..
And now, I have a functional and beautiful page.

And that’s all for now. I will share the progress of the desktop application I’m developing in due time ⏰. And if you need any clarifications on this topic, do well to reach out to me on twitter @_iamEtornam.

Akpẽ Kaka (Thank you in EʋE)

originally posted on medium.com

UI Challenge 2020 (part 1)

Bright Etornam Sunu — Thu, 02 Jan 2020 06:50:49 +0000

As 2020 starts, i'm trying to sharpen my UI skills in flutter and also digging deep into Animation and Native Dart features. I started the year by implementing an Awesome mockup i saw on dribble.com by Mickael Guillaume. The first Page implemented and second page to go!
source code here: https://github.com/RegNex/InstaPic

Don't forget to give it a 🌟
🎊Happy New Year 🎉