DEV Community: Khushi

Integrating Typesense for Sub-Millisecond Search — Lessons from a Job Platform

Khushi — Sun, 21 Jun 2026 13:55:38 +0000

Search isn't "find the matching rows." It's "find the best result, even if the input isn't perfect."

What is Typesense?

Typesense is an open-source search engine — a lighter, simpler alternative to Elasticsearch. It's built for one thing: fast, typo-tolerant, faceted search.

Think of any e-commerce search bar that shows results instantly as you type, even with a typo. That's the experience Typesense is designed for.

A collection works like a table. Each record inside it is a JSON document. You define a schema (which fields are searchable, sortable, etc.), Typesense indexes everything, and queries return in milliseconds.

Collection: books
Document: { "title": "Atomic Habits", "author": "James Clear", "year": 2018 }

search "atomik" in "title" → still returns: Atomic Habits

That typo tolerance, plus filtering (narrow by exact criteria), faceting (show counts like "Fiction (120)" next to filters), and sorting — that's the whole toolkit.

Putting It to Work on a Job Platform

I used this on a real job platform — 6,000 jobs, 5,000 cities, 1 lakh+ candidate profiles. Employers search candidates by skills, experience, education, even resume content. Job seekers search jobs by keywords, salary, location.

The backend runs Frappe + MariaDB. Typesense lives on a separate server, connected purely over its REST API (API key + host). Collections mirror the database tables — jobs, candidates, departments — each row becoming a JSON document.

Typo tolerance is set to 3 characters (so "Bangalor" still finds "Bangalore"), combined with filtering, faceting, and sorting for the full search experience.

What Nobody Tells You

Typesense gives "good enough," not "exact." A recruiter searching for exactly 5 years experience might still see 4.5 or 5.5 years ranked close by, thanks to fuzzy relevance scoring. Great as a fallback — but precision-critical searches still need a path back to the database.

Schema changes aren't free. Adding a new field isn't like an SQL ALTER TABLE. It needs explicit API calls and sometimes re-indexing existing data. Plan your schema upfront.

One collection per table, or one unified collection? Mirroring the database was the easy starting point. But a single denormalized collection — where one document holds job + department + location together — could make cross-entity search ("Senior Backend Developer, Engineering, Ahmedabad") a single query instead of stitched-together results. Still evaluating the trade-off, since denormalizing means more update complexity when source data changes.

Real-time sync is the actual hard part. Every MariaDB write needs to reflect in Typesense, or search goes stale. This is less a Typesense problem and more a "what changed, and where does it need to propagate" coding problem — and it gets more tangled as more entities get linked together.

The Takeaway

The performance gain is real — sub-millisecond, typo-tolerant, faceted search across 1 lakh+ profiles isn't something plain SQL handles gracefully at this scale. Typesense does the heavy lifting. The actual engineering work is in the data flow around it: schema design, sync logic, and knowing when to fall back to the database for precision.

Resources: Typesense Docs · API Reference

I'm Khushi Pandya, a software engineer working on AI-driven backends and search infrastructure. Find me on Dev.to | GitHub | Kaggle

I Trained 7 ML Models on Gujarati Sign Language — Here's What Actually Worked

Khushi — Sat, 13 Jun 2026 14:10:04 +0000

This research was published at IEEE MPSec ICETA 2025. Here's the full story behind it.

Why Gujarati Sign Language?

Most sign language research focuses on American Sign Language (ASL) or Indian Sign Language (ISL). Gujarati sign language — used by the deaf and hard of hearing community across Gujarat — has almost no published AI research.

That felt like a problem worth solving.

The goal was straightforward: build a system that could recognize 34 Gujarati consonant letters from hand gesture images in real time. But before writing a single line of model code, there was a more fundamental problem to solve — the dataset didn't exist.

Building the Dataset from Scratch

There were no publicly available Gujarati sign language image datasets. So I built one.

How the dataset was collected:

34 folders — one for each Gujarati consonant (ક, ખ, ગ... all 34)
110 images per consonant, captured manually
Multiple individuals, varied backgrounds, different lighting conditions
Multiple hand orientations and rotation angles

This diversity was intentional. A model trained only on images from one person in one lighting condition fails the moment someone else uses it. Real-world robustness had to be baked in from the start.

Data augmentation was applied to further expand the dataset — random rotations, flips, scaling, and noise addition — bringing the final count to:

Training set: 3,798 images
Testing set: 749 images

The Preprocessing Pipeline

Raw images can't go directly into a model. Several preprocessing steps were applied:

Noise reduction using Gaussian blur and median filtering — phone cameras in varying lighting produce a lot of visual noise that confuses models.

Hand segmentation using contour detection and skin-tone masking — isolating just the hand region and removing background clutter.

Normalization — all pixel values scaled to [0, 1] and all images resized to 256×256 pixels for consistency across models.

PCA (Principal Component Analysis) — for traditional ML models (SVM, LR, Random Forest, ANN), image data is extremely high-dimensional. PCA reduced this dimensionality while retaining 95% of the data's variance. This made training faster and reduced overfitting risk significantly.

The 7 Models — What Each One Does and Why It Was Included

Rather than jumping straight to deep learning, I wanted to understand how traditional ML compares to modern architectures on this specific problem. So I tested all of them.

1. Support Vector Machine (SVM) with PCA

SVM finds the optimal boundary (hyperplane) that separates different classes with the maximum margin. For image data, the RBF (Radial Basis Function) kernel was used to capture non-linear relationships between classes. Hyperparameter tuning via grid search optimized the regularization and kernel parameters.

2. Logistic Regression (LR) with PCA

The simplest model in the comparison — a probabilistic linear classifier. L2 regularization was applied to prevent overfitting. Included as the baseline: if even logistic regression performs well, the problem might not need deep learning at all.

3. Random Forest

An ensemble method that trains multiple decision trees on random data subsets and aggregates their votes. 50–200 trees were tested, with max depth limited to 10 to prevent overfitting. Feature importance analysis was performed post-training to understand which image features mattered most.

4. Artificial Neural Network (ANN)

A feedforward network with two hidden layers (128 and 64 neurons, ReLU activation) trained on PCA-reduced features. Adam optimizer, 50 epochs, batch size 32, dropout regularization at 0.3 to prevent overfitting.

5. Convolutional Neural Network (CNN)

Custom CNN built from scratch — three convolutional layers with 3×3 filters, max-pooling, and fully connected layers with softmax output. Unlike the models above, CNN works directly on raw image pixels and learns spatial features automatically.

6. ResNet50

A 50-layer deep network with residual (skip) connections that solve the vanishing gradient problem in very deep networks. Transfer learning was used — pretrained on ImageNet, with top layers replaced for 34-class Gujarati classification. Lower layers were frozen initially, then gradually unfrozen during fine-tuning.

7. VGG19

A 19-layer network known for its uniform architecture of sequential 3×3 convolutional layers. The fully connected layers were replaced with a Global Average Pooling (GAP) layer to reduce parameters and improve generalization. L2 regularization (weight decay 0.0001) was applied throughout.

The Results — What Actually Happened

Model	Training Accuracy	Testing Accuracy
SVM + PCA	93.60%	87.97%
Logistic Regression + PCA	92.37%	82.38%
Random Forest	88.44%	88.44%
ANN	90.48%	79.77%
VGG19	94.28%	87.23%
CNN	73.53%	64.21%
ResNet50	54.29%	—

Breaking Down the Surprises

VGG19 won — but not by as much as expected

VGG19 achieved the highest training accuracy (94.28%) and strong testing accuracy (87.23%). Its deep, uniform architecture excels at extracting hierarchical image features — edges → shapes → gesture patterns — which is exactly what hand gesture recognition needs.

But the margin over SVM was surprisingly small.

SVM was the real surprise

SVM with PCA achieved 87.97% testing accuracy — just 0.74% below VGG19, while being dramatically simpler to train and computationally cheaper. For a traditional ML algorithm working on hand gesture images, this is impressive.

Why did it work so well? PCA removed noise from the high-dimensional image data, giving SVM a clean, reduced feature space where it could find clear decision boundaries. The RBF kernel then handled the non-linear separability of different hand shapes.

Lesson: don't underestimate classical ML when preprocessing is done well.

Random Forest was the most consistent

Random Forest showed identical training and testing accuracy (88.44%) — a sign of excellent generalization. No overfitting at all. For a production system where stability matters more than peak accuracy, Random Forest is a strong choice.

CNN underperformed — and here's why

The custom CNN from scratch achieved only 64.21% testing accuracy, the lowest of all models. This is counterintuitive — CNNs are supposed to be great at images.

The reason: CNNs need large datasets to learn meaningful features. With only 3,798 training images across 34 classes (about 112 images per class), the CNN simply didn't have enough data to learn properly. Compare this to VGG19, which came pretrained on millions of ImageNet images and only needed to fine-tune the top layers for our specific gestures.

Lesson: for small datasets, transfer learning always beats training from scratch.

ResNet50 failed completely

54.29% training accuracy with testing accuracy not even reportable. Transfer learning didn't help here — the pre-trained ImageNet weights didn't transfer well to this specific dataset, and the gradual unfreezing strategy wasn't enough to compensate.

This suggests ResNet50 needs either a much larger dataset or significantly different fine-tuning strategy for this task.

ANN gap between training and testing

ANN achieved 90.48% training but only 79.77% testing — the largest gap in the comparison. The 10%+ drop indicates overfitting despite dropout regularization. The model learned the training data well but didn't generalize as effectively as simpler models like SVM or Random Forest.

What I Learned

1. Preprocessing is more important than model choice. The PCA step transformed SVM from a mediocre image classifier into a near-VGG19 performer. Time spent on preprocessing returned more value than time spent on model architecture.

2. Dataset size determines which models you can use. With a small dataset, transfer learning (VGG19) beats training from scratch (CNN, ResNet50). Always start with pretrained models when data is limited.

3. Consistency beats peak performance. Random Forest never overfit, while ANN showed a 10% train-test gap. In production, consistent performance is more valuable than occasional high accuracy.

4. Test everything. I went in assuming deep learning would dominate. The results showed classical ML (SVM) was nearly as good with 1/100th the complexity. That would have been missed if I'd only tested neural networks.

The Real-World Application

Beyond the model comparison, the system was integrated into a web application that converts recognized gestures into spoken output using Text-to-Speech (TTS). A user signs a Gujarati consonant in front of a camera — the model recognizes it — and the system speaks the corresponding letter aloud.

The goal is to help bridge communication between Gujarati sign language users and those who don't know sign language, making everyday interactions in education, healthcare, and public services more accessible.

What's Next

The current system handles 34 consonants. The natural next step is extending it to:

Gujarati vowels
Common words and phrases
Dynamic gestures (involving motion, not just static hand positions)
Real-time mobile deployment

Temporal models like LSTMs or Vision Transformers could handle the dynamic gesture challenge — that's the direction future work is heading.

The Paper

This research was published at IEEE MPSec ICETA 2025 (International Conference on Emerging Technologies and Applications), Gwalior, India.

DOI: 10.1109/MPSecICETA64837.2025.11118850

Kaggle notebooks with the model implementations: kaggle.com/khushihpandya

I'm Khushi Pandya, a software engineer working on AI/ML systems and backend development. Find me on Dev.to | GitHub | Kaggle

I Trained 7 ML Models on Gujarati Sign Language — Here's What Actually Worked

Khushi — Sat, 13 Jun 2026 14:10:04 +0000

This research was published at IEEE MPSec ICETA 2025. Here's the full story behind it.

Why Gujarati Sign Language?

That felt like a problem worth solving.

Building the Dataset from Scratch

There were no publicly available Gujarati sign language image datasets. So I built one.

How the dataset was collected:

34 folders — one for each Gujarati consonant (ક, ખ, ગ... all 34)
110 images per consonant, captured manually
Multiple individuals, varied backgrounds, different lighting conditions
Multiple hand orientations and rotation angles

Data augmentation was applied to further expand the dataset — random rotations, flips, scaling, and noise addition — bringing the final count to:

Training set: 3,798 images
Testing set: 749 images

The Preprocessing Pipeline

Raw images can't go directly into a model. Several preprocessing steps were applied:

Noise reduction using Gaussian blur and median filtering — phone cameras in varying lighting produce a lot of visual noise that confuses models.

Hand segmentation using contour detection and skin-tone masking — isolating just the hand region and removing background clutter.

Normalization — all pixel values scaled to [0, 1] and all images resized to 256×256 pixels for consistency across models.

The 7 Models — What Each One Does and Why It Was Included

Rather than jumping straight to deep learning, I wanted to understand how traditional ML compares to modern architectures on this specific problem. So I tested all of them.

1. Support Vector Machine (SVM) with PCA

2. Logistic Regression (LR) with PCA

3. Random Forest

4. Artificial Neural Network (ANN)

5. Convolutional Neural Network (CNN)

6. ResNet50

7. VGG19

The Results — What Actually Happened

Model	Training Accuracy	Testing Accuracy
SVM + PCA	93.60%	87.97%
Logistic Regression + PCA	92.37%	82.38%
Random Forest	88.44%	88.44%
ANN	90.48%	79.77%
VGG19	94.28%	87.23%
CNN	73.53%	64.21%
ResNet50	54.29%	—

Breaking Down the Surprises

VGG19 won — but not by as much as expected

But the margin over SVM was surprisingly small.

SVM was the real surprise

Lesson: don't underestimate classical ML when preprocessing is done well.

Random Forest was the most consistent

CNN underperformed — and here's why

The custom CNN from scratch achieved only 64.21% testing accuracy, the lowest of all models. This is counterintuitive — CNNs are supposed to be great at images.

Lesson: for small datasets, transfer learning always beats training from scratch.

ResNet50 failed completely

This suggests ResNet50 needs either a much larger dataset or significantly different fine-tuning strategy for this task.

ANN gap between training and testing

What I Learned

3. Consistency beats peak performance. Random Forest never overfit, while ANN showed a 10% train-test gap. In production, consistent performance is more valuable than occasional high accuracy.

The Real-World Application

What's Next

The current system handles 34 consonants. The natural next step is extending it to:

Gujarati vowels
Common words and phrases
Dynamic gestures (involving motion, not just static hand positions)
Real-time mobile deployment

Temporal models like LSTMs or Vision Transformers could handle the dynamic gesture challenge — that's the direction future work is heading.

The Paper

This research was published at IEEE MPSec ICETA 2025 (International Conference on Emerging Technologies and Applications), Gwalior, India.

DOI: 10.1109/MPSecICETA64837.2025.11118850

Kaggle notebooks with the model implementations: kaggle.com/khushihpandya

I'm Khushi Pandya, a software engineer working on AI/ML systems and backend development. Find me on Dev.to | GitHub | Kaggle

How I Built a Voice AI Agent System with Asterisk and LLMs

Khushi — Fri, 05 Jun 2026 18:16:56 +0000

When your support team is busy, AI picks up the phone.

The Problem We Were Solving

Imagine a user calls your company's technical support line at 7 PM. Everyone on the support team has logged off. The phone rings… and rings. The user hangs up frustrated.

That was the exact problem my company wanted to solve. The goal was simple to describe but complex to build: when no human agent is available, an AI agent should pick up the call, understand the user's problem, and answer it — just like a support executive would.

I was given this task as a solo developer. What followed was one of the most technically interesting projects I've worked on — combining telephony, AI, speech processing, and real-time dashboards into one working system.

Here's everything I built, how it works, and what I learned.

The Architecture at a Glance

Before diving deep, here's the high-level flow of a call through the system:

User calls → Asterisk PBX → Python AGI Script
                                    ↓
                          Speech-to-Text (Google STT)
                                    ↓
                          Noise filtering + silence detection
                                    ↓
                          Company Knowledge Base → OpenAI API
                                    ↓
                          Text-to-Speech (Google TTS)
                                    ↓
                        Voice response played back to user
                                    ↓
                    [If unresolved] → Transfer to human agent

Every step in that chain had its own set of challenges. Let's go through each one.

Part 1: Asterisk — The Phone System Brain

What is Asterisk?

Think of Asterisk as a software-based phone exchange — the same kind of system a large call center uses, but open-source and running on a Linux server. It's what connects incoming phone calls to your application logic.

In real life, when you call a bank and hear "Press 1 for account balance, press 2 for loans" — that's a PBX (Private Branch Exchange) system doing the routing. Asterisk is one of the most widely used open-source PBX systems in the world.

For our project, I used FreePBX — a GUI layer on top of Asterisk — which made managing extensions and call routing significantly easier without touching raw config files constantly.

How Asterisk Connects to Your Code

Asterisk has a feature called AGI (Asterisk Gateway Interface) — it lets you write scripts in Python (or any language) that Asterisk calls during a live phone session. Think of it like a webhook, but for phone calls.

When a user calls in, Asterisk hands control to your Python script mid-call. Your script can:

Play audio to the caller
Record the caller's voice
Send that recording somewhere for processing
Play back a response
Transfer the call to another extension

Key Protocols Asterisk Works With

Asterisk relies on two core protocols under the hood:

SIP (Session Initiation Protocol) — handles how calls are initiated, maintained, and ended over IP networks. Think of it as the process of dialing a number and getting the other person to pick up.

RTP (Real-time Transport Protocol) — carries the actual voice audio once a call is connected. If SIP is the handshake, RTP is the conversation itself.

Part 2: Speech Processing — The Hardest Part Nobody Talks About

Getting AI to "hear" properly is not just about sending audio to Google. Raw phone audio is messy — background noise, varying volumes, long silences, people who trail off mid-sentence.

Google STT (Speech-to-Text)

I used Google Cloud Speech-to-Text because it's free within generous limits and supports Indian English well (important for our user base).

Noise Filtering and Volume Normalization

Phone call audio comes in at 8000 Hz (much lower quality than a microphone recording). Background noise is a real problem — fans, traffic, other people talking. I used SoX (Sound eXchange), a command-line audio tool, to pre-process audio before sending it to STT — normalizing volume to a consistent level and stripping out leading/trailing silence automatically.

Silence Detection and Timeout

One subtle but critical thing: how do you know the user has finished speaking?

I configured Asterisk's record command with a silence threshold — if the user stops speaking for more than 2 seconds, the recording stops automatically and processing begins. Too short and it cuts people off mid-sentence. Too long and the call feels unresponsive.

Part 3: The AI Brain — OpenAI + Company Knowledge Base

How the AI Knew What to Answer

If you just send a user's question directly to OpenAI's GPT model, it answers from its general training data — which knows nothing about your company's specific products, policies, or support procedures.

To solve this, my company provided a pre-built knowledge base — a structured dataset of Q&A pairs covering both sales and technical support scenarios, sourced from their website and internal documentation. My job was to integrate this into the pipeline so the AI could reference it when answering.

When a user speaks, their transcribed question is matched against the knowledge base, the most relevant answer context is retrieved, and that gets passed to OpenAI along with a carefully crafted system prompt. One critical instruction in that prompt: keep answers short — this will be spoken out loud on a phone call. Nobody wants to listen to 3 paragraphs read aloud.

If the AI cannot find a confident answer, it tells the user it will connect them to a human agent — and the transfer kicks in automatically.

Part 4: Text-to-Speech — Giving the AI a Voice

I used Google Text-to-Speech (gTTS) — free, supports multiple voices, and sounds natural enough for support calls. The generated audio was converted to the 8000 Hz mono format that Asterisk expects, then played directly into the live call. I configured both male and female voice options and applied volume gain adjustments to ensure the AI voice was clear and comfortable to listen to.

Part 5: Call Routing — Round Robin and Human Escalation

Round Robin for Human Agents

If the AI couldn't resolve an issue (or the user asked to speak to a human), the call needed to be transferred. But what if the first agent didn't pick up?

I implemented round robin dialing — Asterisk tries the first agent's extension, and if there's no answer after 20 seconds, it moves to the next, then the next. If all agents are busy, the caller hears a message and the call is gracefully ended.

Each agent's phone (a mobile app or desk phone) was registered to Asterisk as a SIP extension — same as how your office desk phone connects to the company switchboard.

The Mobile App

Support agents used a SIP-based mobile app (like Zoiper or Linphone — both free) registered to our Asterisk server. This meant agents could receive transferred calls on their phones from anywhere, without needing to be physically at a desk.

Part 6: The Real-Time Dashboard

Beyond the AI agent itself, I built a dashboard for the support team so they could monitor what was happening on every call.

Call statuses tracked in real-time:

🟢 On Call
🟡 On Hold
🔵 Pending (AI handling)
⏭️ Skipped (no agent picked up)
🔁 Transferred (to human agent)

I used Asterisk's AMI (Asterisk Manager Interface) — a socket-based API that streams live call events — to feed real-time data into the dashboard. Every time a call changed state (answered, transferred, ended), AMI fired an event and the dashboard updated instantly.

Call Recording and Playback

All calls were automatically recorded using Asterisk's built-in MixMonitor feature and stored as audio files. The dashboard gave the support team a simple UI to browse, filter by date or status, and play back any recorded call.

What I Learned Building This Alone

1. Phone audio is not microphone audio. Everything you know about audio quality from web projects goes out the window. 8000 Hz, mono, heavy compression — you have to design around it.

2. AI response length matters more than quality for voice. A perfectly accurate 200-word answer is useless on a phone call. The AI prompt must explicitly constrain response length.

3. Silence is a feature, not a bug. Getting silence detection thresholds right — on both the STT input side and the TTS pause side — made the difference between the agent feeling natural and feeling like a broken IVR.

4. Data quality is everything. The knowledge base I received had inconsistent formatting and some duplicate entries. Cleaning that data was 30% of the total work — garbage in, garbage out applies even more strictly when AI is speaking to real users.

5. Test with real phone calls early. The system worked perfectly in local testing. On actual phone calls, background noise revealed three bugs I had never seen in a controlled environment.

Final System Overview

Component	Technology Used
PBX / Telephony	Asterisk + FreePBX
Call Scripting	Python AGI
Speech-to-Text	Google Cloud STT
Audio Processing	SoX
AI / LLM	OpenAI GPT-4o-mini
Knowledge Base	Company-provided Q&A dataset
Text-to-Speech	Google TTS (gTTS)
Agent Mobile App	SIP client (Zoiper/Linphone)
Dashboard	Python + real-time AMI events
Call Recording	Asterisk MixMonitor

Wrapping Up

Building this system as a solo developer taught me that production AI is far less about the AI model itself and far more about the plumbing around it — audio quality, latency, data preparation, and graceful fallback when the AI doesn't know the answer.

The most rewarding moment was watching the first real call go through — a user called, the AI answered, understood the question, found the right answer from our knowledge base, and spoke it back clearly. No human involved.

If you're thinking about building something similar, start small: get Asterisk running locally, write a basic AGI script that just records and plays back audio, then layer in the AI piece once the telephony is solid.

I'm Khushi Pandya, a software engineer working on AI-driven backends, voice systems, and developer tooling. Find me on Medium | GitHub | LinkedIn