DEV Community

Cover image for [AI-Powered ASL Communication App] - Part 1: Cost-Effective Sign Detection Model Training on AWS EKS

[AI-Powered ASL Communication App] - Part 1: Cost-Effective Sign Detection Model Training on AWS EKS

Recently I had an idea for an app that helps with ASL sign communication - wanted to experiment and see if it's feasible. But first, I need a model that can detect signs well enough. I'm not an ML scientist - more of an AI engineer. So ML training is something I'm still learning.

I started by researching existing models. Couldn't find a ready-to-use one that works for my use case:

  • Google SignGemma - Not released to public yet
  • SignLLM - Interesting approach but designed for a different workflow (sign-to-text translation vs real-time detection)

Most existing solutions focus on fingerspelling (alphabet) rather than word-level signs. So I explored training my own model with an ASL dataset.

Initial results:

Challenges

Three things I needed to figure out:

  • No suitable existing model - Nothing I could just download and use for word-level ASL detection
  • Dataset selection - Which dataset has enough samples per class to train reliably?
  • Training approach - How to get effective results without massive compute?

Dataset Selection: Why WLASL-100

I evaluated several datasets:

  • MS-ASL - Looks promising but requires downloading from YouTube. Many videos are now unavailable. Gave up.
  • WLASL-2000 - 2000 classes but the class-to-sample ratio is terrible. Some signs have only 3-5 videos. Not enough for training.
  • WLASL-100 - 100 classes with more samples per class. Better balance.

Went with WLASL-100 (Word-Level American Sign Language). The tradeoff: smaller vocabulary but more reliable training.

Model Selection: Why Pose LSTM (Not VideoMAE or VLMs)

I started with VideoMAE - seemed like the good choice for video understanding. But:

  • 86M parameters is overkill for ~750 training samples
  • Fine-tuning took forever, results were mediocre (~40%)
  • Inference was slow for real-time detection

Then I tried a simpler approach: extract hand landmarks with MediaPipe, feed into a lightweight LSTM. This made sense because:

  • ASL signs are primarily about hand positions and movements
  • MediaPipe gives us 21 landmarks per hand (x, y, z coords) = 126 features
  • Much smaller input than raw video frames
  • Can focus the model on what actually matters

Model Experiments: The Failures Matter

Tried several approaches before finding what works:

Model Params Result What Happened
VideoMAE (fine-tuned) 86M ~40% Too heavy, slow inference
Pose LSTM v1 ~2M 51.52% Decent baseline
Pose LSTM v2 14M 0.61% Massive overfitting
Pose LSTM v2-lite ~1M 58.79% Stripped down, worked
Pose LSTM v3 4.6M 1.82% FocalLoss + too many params
Pose LSTM v3-enhanced ~2M 65%+ Final model

Key learnings:

  • Bigger model != better results (especially with ~750 training samples)
  • 14M params on 750 samples = disaster
  • MediaPipe hand landmarks + BiLSTM + attention pooling = sweet spot
  • Label smoothing and mixup augmentation helped

AWS Infrastructure: EKS + Spot Instances

Here's what I set up:

Cluster Config:

  • AWS EKS in us-east-1
  • Node group: g6.12xlarge (4x L4 GPUs each, 24GB VRAM per GPU)
  • Used spot instances - ~70% cost savings (~$1.72/hr vs $5.67/hr on-demand)
  • Training time: ~1.5-2 hours per run

Why g6.12xlarge:

  • L4 GPUs are newer and cheaper than older V100s
  • 4 GPUs per instance = can run multi-GPU DDP training
  • Spot availability is good for this instance type

Checkpoint Strategy for Spot Instances

Spot instances can be terminated anytime. My solution:

# Save checkpoint every epoch to S3
class CheckpointCallback:
    def __init__(self, s3_bucket, s3_prefix):
        self.s3_bucket = s3_bucket
        self.s3_prefix = s3_prefix

    def on_epoch_end(self, trainer, epoch, metrics):
        # Save locally
        checkpoint_path = self._save_checkpoint(trainer, epoch)

        # Upload to S3 immediately
        self._upload_to_s3(checkpoint_path)
Enter fullscreen mode Exit fullscreen mode

Key points:

  • Checkpoint every epoch (~5-10 min intervals)
  • Upload to S3 immediately after each checkpoint
  • Max loss on spot interruption: one epoch of training
  • Resume from S3 checkpoint on new instance

S3 bucket structure:

s3://asl-model-checkpoints/
  └── runs/
      └── 2025-12-26-v3-enhanced/
          ├── checkpoint-epoch-001/
          ├── checkpoint-epoch-002/
          └── best/
Enter fullscreen mode Exit fullscreen mode

How I Train

The actual training flow:

  1. Data preprocessing

    • Download WLASL videos
    • Extract 32 frames uniformly sampled
    • Run MediaPipe to get hand landmarks (21 points x 2 hands x 3 coords = 126 features)
    • Compute velocity features (frame-to-frame differences)
  2. Model architecture (v3-enhanced)

    • Input: 126 features (hand landmarks)
    • BiLSTM with 3 layers, hidden_dim=384
    • Attention pooling (learns which frames matter)
    • Classifier head with layer norm + dropout
  3. Training config

   batch_size = 32
   learning_rate = 1e-3  # with warmup
   epochs = 50
   early_stopping_patience = 10
   label_smoothing = 0.1
Enter fullscreen mode Exit fullscreen mode
  1. Augmentations
    • Random scaling (0.9-1.1)
    • Random rotation (-15 to +15 degrees)
    • Mixup (alpha=0.2)
    • No horizontal flip (would change sign meaning)

Results

Final model (v3-enhanced):

  • Validation accuracy: 30–40% → 65%+ (from VideoMAE baseline)
  • Top-5 accuracy: ~90%
  • Inference time: <50ms per video
  • Model size: 86M → ~2M params (43x smaller)

Not state-of-the-art, but good enough for a demo app.

Cost Breakdown

For one successful training run:

  • g6.12xlarge spot: ~$1.72/hr x 2 hours = ~$3.44
  • S3 storage (checkpoints + data): ~$0.50/month
  • Total per experiment: under $5

I ran maybe 15-20 experiments total during development. Spot instances saved me hundreds of dollars.

What I'd Do Differently

  • Start with simpler models first (I wasted time on VideoMAE)
  • More aggressive data augmentation earlier
  • Consider synthetic data generation for underrepresented classes
  • Set up proper MLflow experiment tracking from day one

What's Next

This post covered the ML training part. Next, I'll write about deploying the model and building the app:

User signs → Webcam capture → MediaPipe → Model inference → Predicted gloss → TTS → Audio output
Enter fullscreen mode Exit fullscreen mode

Part 2 will cover the AI engineering side - FastAPI server, EKS deployment, and how it all connects.

Top comments (0)