Recently I had an idea for an app that helps with ASL sign communication - wanted to experiment and see if it's feasible. But first, I need a model that can detect signs well enough. I'm not an ML scientist - more of an AI engineer. So ML training is something I'm still learning.
I started by researching existing models. Couldn't find a ready-to-use one that works for my use case:
- Google SignGemma - Not released to public yet
- SignLLM - Interesting approach but designed for a different workflow (sign-to-text translation vs real-time detection)
Most existing solutions focus on fingerspelling (alphabet) rather than word-level signs. So I explored training my own model with an ASL dataset.
Initial results:
Challenges
Three things I needed to figure out:
- No suitable existing model - Nothing I could just download and use for word-level ASL detection
- Dataset selection - Which dataset has enough samples per class to train reliably?
- Training approach - How to get effective results without massive compute?
Dataset Selection: Why WLASL-100
I evaluated several datasets:
- MS-ASL - Looks promising but requires downloading from YouTube. Many videos are now unavailable. Gave up.
- WLASL-2000 - 2000 classes but the class-to-sample ratio is terrible. Some signs have only 3-5 videos. Not enough for training.
- WLASL-100 - 100 classes with more samples per class. Better balance.
Went with WLASL-100 (Word-Level American Sign Language). The tradeoff: smaller vocabulary but more reliable training.
Model Selection: Why Pose LSTM (Not VideoMAE or VLMs)
I started with VideoMAE - seemed like the good choice for video understanding. But:
- 86M parameters is overkill for ~750 training samples
- Fine-tuning took forever, results were mediocre (~40%)
- Inference was slow for real-time detection
Then I tried a simpler approach: extract hand landmarks with MediaPipe, feed into a lightweight LSTM. This made sense because:
- ASL signs are primarily about hand positions and movements
- MediaPipe gives us 21 landmarks per hand (x, y, z coords) = 126 features
- Much smaller input than raw video frames
- Can focus the model on what actually matters
Model Experiments: The Failures Matter
Tried several approaches before finding what works:
| Model | Params | Result | What Happened |
|---|---|---|---|
| VideoMAE (fine-tuned) | 86M | ~40% | Too heavy, slow inference |
| Pose LSTM v1 | ~2M | 51.52% | Decent baseline |
| Pose LSTM v2 | 14M | 0.61% | Massive overfitting |
| Pose LSTM v2-lite | ~1M | 58.79% | Stripped down, worked |
| Pose LSTM v3 | 4.6M | 1.82% | FocalLoss + too many params |
| Pose LSTM v3-enhanced | ~2M | 65%+ | Final model |
Key learnings:
- Bigger model != better results (especially with ~750 training samples)
- 14M params on 750 samples = disaster
- MediaPipe hand landmarks + BiLSTM + attention pooling = sweet spot
- Label smoothing and mixup augmentation helped
AWS Infrastructure: EKS + Spot Instances
Here's what I set up:
Cluster Config:
- AWS EKS in us-east-1
- Node group: g6.12xlarge (4x L4 GPUs each, 24GB VRAM per GPU)
- Used spot instances - ~70% cost savings (~$1.72/hr vs $5.67/hr on-demand)
- Training time: ~1.5-2 hours per run
Why g6.12xlarge:
- L4 GPUs are newer and cheaper than older V100s
- 4 GPUs per instance = can run multi-GPU DDP training
- Spot availability is good for this instance type
Checkpoint Strategy for Spot Instances
Spot instances can be terminated anytime. My solution:
# Save checkpoint every epoch to S3
class CheckpointCallback:
def __init__(self, s3_bucket, s3_prefix):
self.s3_bucket = s3_bucket
self.s3_prefix = s3_prefix
def on_epoch_end(self, trainer, epoch, metrics):
# Save locally
checkpoint_path = self._save_checkpoint(trainer, epoch)
# Upload to S3 immediately
self._upload_to_s3(checkpoint_path)
Key points:
- Checkpoint every epoch (~5-10 min intervals)
- Upload to S3 immediately after each checkpoint
- Max loss on spot interruption: one epoch of training
- Resume from S3 checkpoint on new instance
S3 bucket structure:
s3://asl-model-checkpoints/
└── runs/
└── 2025-12-26-v3-enhanced/
├── checkpoint-epoch-001/
├── checkpoint-epoch-002/
└── best/
How I Train
The actual training flow:
-
Data preprocessing
- Download WLASL videos
- Extract 32 frames uniformly sampled
- Run MediaPipe to get hand landmarks (21 points x 2 hands x 3 coords = 126 features)
- Compute velocity features (frame-to-frame differences)
-
Model architecture (v3-enhanced)
- Input: 126 features (hand landmarks)
- BiLSTM with 3 layers, hidden_dim=384
- Attention pooling (learns which frames matter)
- Classifier head with layer norm + dropout
Training config
batch_size = 32
learning_rate = 1e-3 # with warmup
epochs = 50
early_stopping_patience = 10
label_smoothing = 0.1
-
Augmentations
- Random scaling (0.9-1.1)
- Random rotation (-15 to +15 degrees)
- Mixup (alpha=0.2)
- No horizontal flip (would change sign meaning)
Results
Final model (v3-enhanced):
- Validation accuracy: 30–40% → 65%+ (from VideoMAE baseline)
- Top-5 accuracy: ~90%
- Inference time: <50ms per video
- Model size: 86M → ~2M params (43x smaller)
Not state-of-the-art, but good enough for a demo app.
Cost Breakdown
For one successful training run:
- g6.12xlarge spot: ~$1.72/hr x 2 hours = ~$3.44
- S3 storage (checkpoints + data): ~$0.50/month
- Total per experiment: under $5
I ran maybe 15-20 experiments total during development. Spot instances saved me hundreds of dollars.
What I'd Do Differently
- Start with simpler models first (I wasted time on VideoMAE)
- More aggressive data augmentation earlier
- Consider synthetic data generation for underrepresented classes
- Set up proper MLflow experiment tracking from day one
What's Next
This post covered the ML training part. Next, I'll write about deploying the model and building the app:
User signs → Webcam capture → MediaPipe → Model inference → Predicted gloss → TTS → Audio output
Part 2 will cover the AI engineering side - FastAPI server, EKS deployment, and how it all connects.

Top comments (0)