Ooi Yee Fei for AWS Community Builders

Posted on Jan 3

[AI-Powered ASL Communication App] - Part 1: Cost-Effective Sign Detection Model Training on AWS EKS

#ai #aws #machinelearning #programming

Recently I had an idea for an app that helps with ASL sign communication - wanted to experiment and see if it's feasible. But first, I need a model that can detect signs well enough. I'm not an ML scientist - more of an AI engineer. So ML training is something I'm still learning.

I started by researching existing models. Couldn't find a ready-to-use one that works for my use case:

Google SignGemma - Not released to public yet
SignLLM - Interesting approach but designed for a different workflow (sign-to-text translation vs real-time detection)

Most existing solutions focus on fingerspelling (alphabet) rather than word-level signs. So I explored training my own model with an ASL dataset.

Initial results:

Challenges

Three things I needed to figure out:

No suitable existing model - Nothing I could just download and use for word-level ASL detection
Dataset selection - Which dataset has enough samples per class to train reliably?
Training approach - How to get effective results without massive compute?

Dataset Selection: Why WLASL-100

I evaluated several datasets:

MS-ASL - Looks promising but requires downloading from YouTube. Many videos are now unavailable. Gave up.
WLASL-2000 - 2000 classes but the class-to-sample ratio is terrible. Some signs have only 3-5 videos. Not enough for training.
WLASL-100 - 100 classes with more samples per class. Better balance.

Went with WLASL-100 (Word-Level American Sign Language). The tradeoff: smaller vocabulary but more reliable training.

Model Selection: Why Pose LSTM (Not VideoMAE or VLMs)

I started with VideoMAE - seemed like the good choice for video understanding. But:

86M parameters is overkill for ~750 training samples
Fine-tuning took forever, results were mediocre (~40%)
Inference was slow for real-time detection

Then I tried a simpler approach: extract hand landmarks with MediaPipe, feed into a lightweight LSTM. This made sense because:

ASL signs are primarily about hand positions and movements
MediaPipe gives us 21 landmarks per hand (x, y, z coords) = 126 features
Much smaller input than raw video frames
Can focus the model on what actually matters

Model Experiments: The Failures Matter

Tried several approaches before finding what works:

Model	Params	Result	What Happened
VideoMAE (fine-tuned)	86M	~40%	Too heavy, slow inference
Pose LSTM v1	~2M	51.52%	Decent baseline
Pose LSTM v2	14M	0.61%	Massive overfitting
Pose LSTM v2-lite	~1M	58.79%	Stripped down, worked
Pose LSTM v3	4.6M	1.82%	FocalLoss + too many params
Pose LSTM v3-enhanced	~2M	65%+	Final model

Key learnings:

Bigger model != better results (especially with ~750 training samples)
14M params on 750 samples = disaster
MediaPipe hand landmarks + BiLSTM + attention pooling = sweet spot
Label smoothing and mixup augmentation helped

AWS Infrastructure: EKS + Spot Instances

Here's what I set up:

Cluster Config:

AWS EKS in us-east-1
Node group: g6.12xlarge (4x L4 GPUs each, 24GB VRAM per GPU)
Used spot instances - ~70% cost savings (~$1.72/hr vs $5.67/hr on-demand)
Training time: ~1.5-2 hours per run

Why g6.12xlarge:

L4 GPUs are newer and cheaper than older V100s
4 GPUs per instance = can run multi-GPU DDP training
Spot availability is good for this instance type

Checkpoint Strategy for Spot Instances

Spot instances can be terminated anytime. My solution:

# Save checkpoint every epoch to S3
class CheckpointCallback:
    def __init__(self, s3_bucket, s3_prefix):
        self.s3_bucket = s3_bucket
        self.s3_prefix = s3_prefix

    def on_epoch_end(self, trainer, epoch, metrics):
        # Save locally
        checkpoint_path = self._save_checkpoint(trainer, epoch)

        # Upload to S3 immediately
        self._upload_to_s3(checkpoint_path)

Key points:

Checkpoint every epoch (~5-10 min intervals)
Upload to S3 immediately after each checkpoint
Max loss on spot interruption: one epoch of training
Resume from S3 checkpoint on new instance

S3 bucket structure:

s3://asl-model-checkpoints/
  └── runs/
      └── 2025-12-26-v3-enhanced/
          ├── checkpoint-epoch-001/
          ├── checkpoint-epoch-002/
          └── best/

How I Train

The actual training flow:

Data preprocessing
- Download WLASL videos
- Extract 32 frames uniformly sampled
- Run MediaPipe to get hand landmarks (21 points x 2 hands x 3 coords = 126 features)
- Compute velocity features (frame-to-frame differences)
Model architecture (v3-enhanced)
- Input: 126 features (hand landmarks)
- BiLSTM with 3 layers, hidden_dim=384
- Attention pooling (learns which frames matter)
- Classifier head with layer norm + dropout
Training config

   batch_size = 32
   learning_rate = 1e-3  # with warmup
   epochs = 50
   early_stopping_patience = 10
   label_smoothing = 0.1

Augmentations
- Random scaling (0.9-1.1)
- Random rotation (-15 to +15 degrees)
- Mixup (alpha=0.2)
- No horizontal flip (would change sign meaning)

Results

Final model (v3-enhanced):

Validation accuracy: 30–40% → 65%+ (from VideoMAE baseline)
Top-5 accuracy: ~90%
Inference time: <50ms per video
Model size: 86M → ~2M params (43x smaller)

Not state-of-the-art, but good enough for a demo app.

Cost Breakdown

For one successful training run:

g6.12xlarge spot: ~$1.72/hr x 2 hours = ~$3.44
S3 storage (checkpoints + data): ~$0.50/month
Total per experiment: under $5

I ran maybe 15-20 experiments total during development. Spot instances saved me hundreds of dollars.

What I'd Do Differently

Start with simpler models first (I wasted time on VideoMAE)
More aggressive data augmentation earlier
Consider synthetic data generation for underrepresented classes
Set up proper MLflow experiment tracking from day one

What's Next

This post covered the ML training part. Next, I'll write about deploying the model and building the app:

User signs → Webcam capture → MediaPipe → Model inference → Predicted gloss → TTS → Audio output

Part 2 will cover the AI engineering side - FastAPI server, EKS deployment, and how it all connects.

DEV Community