DEV Community

Khushi
Khushi

Posted on

How I Built a Voice AI Agent System with Asterisk, LLMs, and RAG — As a Solo Developer

When your support team is busy, AI picks up the phone.


The Problem We Were Solving

Imagine a user calls your company's technical support line at 7 PM. Everyone on the support team has logged off. The phone rings… and rings. The user hangs up frustrated.

That was the exact problem my company wanted to solve. The goal was simple to describe but complex to build: when no human agent is available, an AI agent should pick up the call, understand the user's problem, and answer it — just like a support executive would.

I was given this task as a solo developer. What followed was one of the most technically interesting projects I've worked on — combining telephony, AI, speech processing, and real-time dashboards into one working system.

Here's everything I built, how it works, and what I learned.


The Architecture at a Glance

Before diving deep, here's the high-level flow of a call through the system:

User calls → Asterisk PBX → Python AGI Script
                                    ↓
                          Speech-to-Text (Google STT)
                                    ↓
                          Noise filtering + silence detection
                                    ↓
                          RAG pipeline → OpenAI API
                                    ↓
                          Text-to-Speech (Google TTS)
                                    ↓
                        Voice response played back to user
                                    ↓
                    [If unresolved] → Transfer to human agent
Enter fullscreen mode Exit fullscreen mode

Every step in that chain had its own set of challenges. Let's go through each one.


Part 1: Asterisk — The Phone System Brain

What is Asterisk?

Think of Asterisk as a software-based phone exchange — the same kind of system a large call center uses, but open-source and running on a Linux server. It's what connects incoming phone calls to your application logic.

In real life, when you call a bank and hear "Press 1 for account balance, press 2 for loans" — that's a PBX (Private Branch Exchange) system doing the routing. Asterisk is one of the most widely used open-source PBX systems in the world.

For our project, I used FreePBX — a GUI layer on top of Asterisk — which made managing extensions and call routing significantly easier without touching raw config files constantly.

How Asterisk Connects to Your Code

Asterisk has a feature called AGI (Asterisk Gateway Interface) — it lets you write scripts in Python (or any language) that Asterisk calls during a live phone session. Think of it like a webhook, but for phone calls.

When a user calls in, Asterisk hands control to your Python script mid-call. Your script can:

  • Play audio to the caller
  • Record the caller's voice
  • Send that recording somewhere for processing
  • Play back a response
  • Transfer the call to another extension

Key Protocols Asterisk Works With

Asterisk relies on two core protocols under the hood:

SIP (Session Initiation Protocol) — handles how calls are initiated, maintained, and ended over IP networks. Think of it as the process of dialing a number and getting the other person to pick up.

RTP (Real-time Transport Protocol) — carries the actual voice audio once a call is connected. If SIP is the handshake, RTP is the conversation itself.


Part 2: Speech Processing — The Hardest Part Nobody Talks About

Getting AI to "hear" properly is not just about sending audio to Google. Raw phone audio is messy — background noise, varying volumes, long silences, people who trail off mid-sentence.

Google STT (Speech-to-Text)

I used Google Cloud Speech-to-Text because it's free within generous limits and supports Indian English well (important for our user base).

Noise Filtering and Volume Normalization

Phone call audio comes in at 8000 Hz (much lower quality than a microphone recording). Background noise is a real problem — fans, traffic, other people talking. I used SoX (Sound eXchange), a command-line audio tool, to pre-process audio before sending it to STT — normalizing volume to a consistent level and stripping out leading/trailing silence automatically.

Silence Detection and Timeout

One subtle but critical thing: how do you know the user has finished speaking?

I configured Asterisk's record command with a silence threshold — if the user stops speaking for more than 2 seconds, the recording stops automatically and processing begins. Too short and it cuts people off mid-sentence. Too long and the call feels unresponsive.


Part 3: The AI Brain — RAG + OpenAI

Why RAG Instead of Just Asking ChatGPT?

If you just send a user's question directly to OpenAI's GPT model, it will answer from its general training data — which knows nothing about your company's specific products, policies, or support procedures.

RAG (Retrieval-Augmented Generation) solves this by first retrieving relevant information from your own knowledge base, then passing that to the LLM as context.

Real-life analogy: Instead of asking a new employee to answer a customer question from memory, you first hand them the relevant page from the manual, then ask them to answer.

Preparing the Knowledge Base

My company gave me a JSON file containing Q&A pairs scraped from their website — covering both sales and technical support scenarios. I cleaned, processed, and converted this data into vector embeddings stored in a vector database. Each entry was indexed so the system could instantly find the most relevant answer for any incoming question.

Retrieving Relevant Context + Querying OpenAI

When a user speaks, the system embeds their question, finds the top 3 most relevant entries from the knowledge base using cosine similarity, and passes that context to OpenAI along with a system prompt. One critical instruction in that prompt: keep answers short — this will be spoken out loud on a phone call. Nobody wants to listen to 3 paragraphs read aloud.

Complete flow


Part 4: Text-to-Speech — Giving the AI a Voice

I used Google Text-to-Speech (gTTS) — free, supports multiple voices, and sounds natural enough for support calls. The generated audio was converted to the 8000 Hz mono format that Asterisk expects, then played directly into the live call. I configured both male and female voice options and applied volume gain adjustments to ensure the AI voice was clear and comfortable to listen to.


Part 5: Call Routing — Round Robin and Human Escalation

Round Robin for Human Agents

If the AI couldn't resolve an issue (or the user asked to speak to a human), the call needed to be transferred. But what if the first agent didn't pick up?

I implemented round robin dialing — Asterisk tries the first agent's extension, and if there's no answer after 20 seconds, it moves to the next, then the next. If all agents are busy, the caller hears a message and the call is gracefully ended.

Each agent's phone (a mobile app or desk phone) was registered to Asterisk as a SIP extension — same as how your office desk phone connects to the company switchboard.

The Mobile App

Support agents used a SIP-based mobile app (like Zoiper or Linphone — both free) registered to our Asterisk server. This meant agents could receive transferred calls on their phones from anywhere, without needing to be physically at a desk.


Part 6: The Real-Time Dashboard

Beyond the AI agent itself, I built a dashboard for the support team so they could monitor what was happening on every call.

Call statuses tracked in real-time:

  • 🟢 On Call
  • 🟡 On Hold
  • 🔵 Pending (AI handling)
  • ⏭️ Skipped (no agent picked up)
  • 🔁 Transferred (to human agent)

I used Asterisk's AMI (Asterisk Manager Interface) — a socket-based API that streams live call events — to feed real-time data into the dashboard. Every time a call changed state (answered, transferred, ended), AMI fired an event and the dashboard updated instantly.

Call Recording and Playback

All calls were automatically recorded using Asterisk's built-in MixMonitor feature and stored as audio files. The dashboard gave the support team a simple UI to browse, filter by date or status, and play back any recorded call.


What I Learned Building This Alone

1. Phone audio is not microphone audio. Everything you know about audio quality from web projects goes out the window. 8000 Hz, mono, heavy compression — you have to design around it.

2. AI response length matters more than quality for voice. A perfectly accurate 200-word answer is useless on a phone call. The AI prompt must explicitly constrain response length.

3. Silence is a feature, not a bug. Getting silence detection thresholds right — on both the STT input side and the TTS pause side — made the difference between the agent feeling natural and feeling like a broken IVR.

4. RAG quality depends entirely on data quality. The JSON I received had inconsistent formatting and some duplicate entries. Cleaning that data was 30% of the total work.

5. Test with real phone calls early. The system worked perfectly in local testing. On actual phone calls, background noise revealed three bugs I had never seen in a controlled environment.


Final System Overview

Component Technology Used
PBX / Telephony Asterisk + FreePBX
Call Scripting Python AGI
Speech-to-Text Google Cloud STT
Audio Processing SoX
AI / LLM OpenAI GPT-4o-mini
Knowledge Base RAG with Sentence Transformers
Text-to-Speech Google TTS (gTTS)
Agent Mobile App SIP client (Zoiper/Linphone)
Dashboard Python + real-time AMI events
Call Recording Asterisk MixMonitor

Wrapping Up

Building this system as a solo developer taught me that production AI is far less about the AI model itself and far more about the plumbing around it — audio quality, latency, data preparation, and graceful fallback when the AI doesn't know the answer.

The most rewarding moment was watching the first real call go through — a user called, the AI answered, understood the question, found the right answer from our knowledge base, and spoke it back clearly. No human involved.

If you're thinking about building something similar, start small: get Asterisk running locally, write a basic AGI script that just records and plays back audio, then layer in the AI piece once the telephony is solid.


I'm Khushi Pandya, a software engineer working on AI-driven backends, voice systems, and developer tooling. Find me on Medium | GitHub | LinkedIn

Top comments (0)