This research was published at IEEE MPSec ICETA 2025. Here's the full story behind it.
Why Gujarati Sign Language?
Most sign language research focuses on American Sign Language (ASL) or Indian Sign Language (ISL). Gujarati sign language — used by the deaf and hard of hearing community across Gujarat — has almost no published AI research.
That felt like a problem worth solving.
The goal was straightforward: build a system that could recognize 34 Gujarati consonant letters from hand gesture images in real time. But before writing a single line of model code, there was a more fundamental problem to solve — the dataset didn't exist.
Building the Dataset from Scratch
There were no publicly available Gujarati sign language image datasets. So I built one.
How the dataset was collected:
- 34 folders — one for each Gujarati consonant (ક, ખ, ગ... all 34)
- 110 images per consonant, captured manually
- Multiple individuals, varied backgrounds, different lighting conditions
- Multiple hand orientations and rotation angles
This diversity was intentional. A model trained only on images from one person in one lighting condition fails the moment someone else uses it. Real-world robustness had to be baked in from the start.
Data augmentation was applied to further expand the dataset — random rotations, flips, scaling, and noise addition — bringing the final count to:
- Training set: 3,798 images
- Testing set: 749 images
The Preprocessing Pipeline
Raw images can't go directly into a model. Several preprocessing steps were applied:
Noise reduction using Gaussian blur and median filtering — phone cameras in varying lighting produce a lot of visual noise that confuses models.
Hand segmentation using contour detection and skin-tone masking — isolating just the hand region and removing background clutter.
Normalization — all pixel values scaled to [0, 1] and all images resized to 256×256 pixels for consistency across models.
PCA (Principal Component Analysis) — for traditional ML models (SVM, LR, Random Forest, ANN), image data is extremely high-dimensional. PCA reduced this dimensionality while retaining 95% of the data's variance. This made training faster and reduced overfitting risk significantly.
The 7 Models — What Each One Does and Why It Was Included
Rather than jumping straight to deep learning, I wanted to understand how traditional ML compares to modern architectures on this specific problem. So I tested all of them.
1. Support Vector Machine (SVM) with PCA
SVM finds the optimal boundary (hyperplane) that separates different classes with the maximum margin. For image data, the RBF (Radial Basis Function) kernel was used to capture non-linear relationships between classes. Hyperparameter tuning via grid search optimized the regularization and kernel parameters.
2. Logistic Regression (LR) with PCA
The simplest model in the comparison — a probabilistic linear classifier. L2 regularization was applied to prevent overfitting. Included as the baseline: if even logistic regression performs well, the problem might not need deep learning at all.
3. Random Forest
An ensemble method that trains multiple decision trees on random data subsets and aggregates their votes. 50–200 trees were tested, with max depth limited to 10 to prevent overfitting. Feature importance analysis was performed post-training to understand which image features mattered most.
4. Artificial Neural Network (ANN)
A feedforward network with two hidden layers (128 and 64 neurons, ReLU activation) trained on PCA-reduced features. Adam optimizer, 50 epochs, batch size 32, dropout regularization at 0.3 to prevent overfitting.
5. Convolutional Neural Network (CNN)
Custom CNN built from scratch — three convolutional layers with 3×3 filters, max-pooling, and fully connected layers with softmax output. Unlike the models above, CNN works directly on raw image pixels and learns spatial features automatically.
6. ResNet50
A 50-layer deep network with residual (skip) connections that solve the vanishing gradient problem in very deep networks. Transfer learning was used — pretrained on ImageNet, with top layers replaced for 34-class Gujarati classification. Lower layers were frozen initially, then gradually unfrozen during fine-tuning.
7. VGG19
A 19-layer network known for its uniform architecture of sequential 3×3 convolutional layers. The fully connected layers were replaced with a Global Average Pooling (GAP) layer to reduce parameters and improve generalization. L2 regularization (weight decay 0.0001) was applied throughout.
The Results — What Actually Happened
| Model | Training Accuracy | Testing Accuracy |
|---|---|---|
| SVM + PCA | 93.60% | 87.97% |
| Logistic Regression + PCA | 92.37% | 82.38% |
| Random Forest | 88.44% | 88.44% |
| ANN | 90.48% | 79.77% |
| VGG19 | 94.28% | 87.23% |
| CNN | 73.53% | 64.21% |
| ResNet50 | 54.29% | — |
Breaking Down the Surprises
VGG19 won — but not by as much as expected
VGG19 achieved the highest training accuracy (94.28%) and strong testing accuracy (87.23%). Its deep, uniform architecture excels at extracting hierarchical image features — edges → shapes → gesture patterns — which is exactly what hand gesture recognition needs.
But the margin over SVM was surprisingly small.
SVM was the real surprise
SVM with PCA achieved 87.97% testing accuracy — just 0.74% below VGG19, while being dramatically simpler to train and computationally cheaper. For a traditional ML algorithm working on hand gesture images, this is impressive.
Why did it work so well? PCA removed noise from the high-dimensional image data, giving SVM a clean, reduced feature space where it could find clear decision boundaries. The RBF kernel then handled the non-linear separability of different hand shapes.
Lesson: don't underestimate classical ML when preprocessing is done well.
Random Forest was the most consistent
Random Forest showed identical training and testing accuracy (88.44%) — a sign of excellent generalization. No overfitting at all. For a production system where stability matters more than peak accuracy, Random Forest is a strong choice.
CNN underperformed — and here's why
The custom CNN from scratch achieved only 64.21% testing accuracy, the lowest of all models. This is counterintuitive — CNNs are supposed to be great at images.
The reason: CNNs need large datasets to learn meaningful features. With only 3,798 training images across 34 classes (about 112 images per class), the CNN simply didn't have enough data to learn properly. Compare this to VGG19, which came pretrained on millions of ImageNet images and only needed to fine-tune the top layers for our specific gestures.
Lesson: for small datasets, transfer learning always beats training from scratch.
ResNet50 failed completely
54.29% training accuracy with testing accuracy not even reportable. Transfer learning didn't help here — the pre-trained ImageNet weights didn't transfer well to this specific dataset, and the gradual unfreezing strategy wasn't enough to compensate.
This suggests ResNet50 needs either a much larger dataset or significantly different fine-tuning strategy for this task.
ANN gap between training and testing
ANN achieved 90.48% training but only 79.77% testing — the largest gap in the comparison. The 10%+ drop indicates overfitting despite dropout regularization. The model learned the training data well but didn't generalize as effectively as simpler models like SVM or Random Forest.
What I Learned
1. Preprocessing is more important than model choice. The PCA step transformed SVM from a mediocre image classifier into a near-VGG19 performer. Time spent on preprocessing returned more value than time spent on model architecture.
2. Dataset size determines which models you can use. With a small dataset, transfer learning (VGG19) beats training from scratch (CNN, ResNet50). Always start with pretrained models when data is limited.
3. Consistency beats peak performance. Random Forest never overfit, while ANN showed a 10% train-test gap. In production, consistent performance is more valuable than occasional high accuracy.
4. Test everything. I went in assuming deep learning would dominate. The results showed classical ML (SVM) was nearly as good with 1/100th the complexity. That would have been missed if I'd only tested neural networks.
The Real-World Application
Beyond the model comparison, the system was integrated into a web application that converts recognized gestures into spoken output using Text-to-Speech (TTS). A user signs a Gujarati consonant in front of a camera — the model recognizes it — and the system speaks the corresponding letter aloud.
The goal is to help bridge communication between Gujarati sign language users and those who don't know sign language, making everyday interactions in education, healthcare, and public services more accessible.
What's Next
The current system handles 34 consonants. The natural next step is extending it to:
- Gujarati vowels
- Common words and phrases
- Dynamic gestures (involving motion, not just static hand positions)
- Real-time mobile deployment
Temporal models like LSTMs or Vision Transformers could handle the dynamic gesture challenge — that's the direction future work is heading.
The Paper
This research was published at IEEE MPSec ICETA 2025 (International Conference on Emerging Technologies and Applications), Gwalior, India.
DOI: 10.1109/MPSecICETA64837.2025.11118850
Kaggle notebooks with the model implementations: kaggle.com/khushihpandya
I'm Khushi Pandya, a software engineer working on AI/ML systems and backend development. Find me on Dev.to | GitHub | Kaggle
Top comments (0)