Khushi

Posted on Jun 5

How I Built a Voice AI Agent System with Asterisk and LLMs

#ai #asterisk #voiceai #llm

When your support team is busy, AI picks up the phone.

The Problem We Were Solving

Imagine a user calls your company's technical support line at 7 PM. Everyone on the support team has logged off. The phone rings… and rings. The user hangs up frustrated.

That was the exact problem my company wanted to solve. The goal was simple to describe but complex to build: when no human agent is available, an AI agent should pick up the call, understand the user's problem, and answer it — just like a support executive would.

I was given this task as a solo developer. What followed was one of the most technically interesting projects I've worked on — combining telephony, AI, speech processing, and real-time dashboards into one working system.

Here's everything I built, how it works, and what I learned.

The Architecture at a Glance

Before diving deep, here's the high-level flow of a call through the system:

User calls → Asterisk PBX → Python AGI Script
                                    ↓
                          Speech-to-Text (Google STT)
                                    ↓
                          Noise filtering + silence detection
                                    ↓
                          Company Knowledge Base → OpenAI API
                                    ↓
                          Text-to-Speech (Google TTS)
                                    ↓
                        Voice response played back to user
                                    ↓
                    [If unresolved] → Transfer to human agent

Every step in that chain had its own set of challenges. Let's go through each one.

Part 1: Asterisk — The Phone System Brain

What is Asterisk?

Think of Asterisk as a software-based phone exchange — the same kind of system a large call center uses, but open-source and running on a Linux server. It's what connects incoming phone calls to your application logic.

In real life, when you call a bank and hear "Press 1 for account balance, press 2 for loans" — that's a PBX (Private Branch Exchange) system doing the routing. Asterisk is one of the most widely used open-source PBX systems in the world.

For our project, I used FreePBX — a GUI layer on top of Asterisk — which made managing extensions and call routing significantly easier without touching raw config files constantly.

How Asterisk Connects to Your Code

Asterisk has a feature called AGI (Asterisk Gateway Interface) — it lets you write scripts in Python (or any language) that Asterisk calls during a live phone session. Think of it like a webhook, but for phone calls.

When a user calls in, Asterisk hands control to your Python script mid-call. Your script can:

Play audio to the caller
Record the caller's voice
Send that recording somewhere for processing
Play back a response
Transfer the call to another extension

Key Protocols Asterisk Works With

Asterisk relies on two core protocols under the hood:

SIP (Session Initiation Protocol) — handles how calls are initiated, maintained, and ended over IP networks. Think of it as the process of dialing a number and getting the other person to pick up.

RTP (Real-time Transport Protocol) — carries the actual voice audio once a call is connected. If SIP is the handshake, RTP is the conversation itself.

Part 2: Speech Processing — The Hardest Part Nobody Talks About

Getting AI to "hear" properly is not just about sending audio to Google. Raw phone audio is messy — background noise, varying volumes, long silences, people who trail off mid-sentence.

Google STT (Speech-to-Text)

I used Google Cloud Speech-to-Text because it's free within generous limits and supports Indian English well (important for our user base).

Noise Filtering and Volume Normalization

Phone call audio comes in at 8000 Hz (much lower quality than a microphone recording). Background noise is a real problem — fans, traffic, other people talking. I used SoX (Sound eXchange), a command-line audio tool, to pre-process audio before sending it to STT — normalizing volume to a consistent level and stripping out leading/trailing silence automatically.

Silence Detection and Timeout

One subtle but critical thing: how do you know the user has finished speaking?

I configured Asterisk's record command with a silence threshold — if the user stops speaking for more than 2 seconds, the recording stops automatically and processing begins. Too short and it cuts people off mid-sentence. Too long and the call feels unresponsive.

Part 3: The AI Brain — OpenAI + Company Knowledge Base

How the AI Knew What to Answer

If you just send a user's question directly to OpenAI's GPT model, it answers from its general training data — which knows nothing about your company's specific products, policies, or support procedures.

To solve this, my company provided a pre-built knowledge base — a structured dataset of Q&A pairs covering both sales and technical support scenarios, sourced from their website and internal documentation. My job was to integrate this into the pipeline so the AI could reference it when answering.

When a user speaks, their transcribed question is matched against the knowledge base, the most relevant answer context is retrieved, and that gets passed to OpenAI along with a carefully crafted system prompt. One critical instruction in that prompt: keep answers short — this will be spoken out loud on a phone call. Nobody wants to listen to 3 paragraphs read aloud.

If the AI cannot find a confident answer, it tells the user it will connect them to a human agent — and the transfer kicks in automatically.

Part 4: Text-to-Speech — Giving the AI a Voice

I used Google Text-to-Speech (gTTS) — free, supports multiple voices, and sounds natural enough for support calls. The generated audio was converted to the 8000 Hz mono format that Asterisk expects, then played directly into the live call. I configured both male and female voice options and applied volume gain adjustments to ensure the AI voice was clear and comfortable to listen to.

Part 5: Call Routing — Round Robin and Human Escalation

Round Robin for Human Agents

If the AI couldn't resolve an issue (or the user asked to speak to a human), the call needed to be transferred. But what if the first agent didn't pick up?

I implemented round robin dialing — Asterisk tries the first agent's extension, and if there's no answer after 20 seconds, it moves to the next, then the next. If all agents are busy, the caller hears a message and the call is gracefully ended.

Each agent's phone (a mobile app or desk phone) was registered to Asterisk as a SIP extension — same as how your office desk phone connects to the company switchboard.

The Mobile App

Support agents used a SIP-based mobile app (like Zoiper or Linphone — both free) registered to our Asterisk server. This meant agents could receive transferred calls on their phones from anywhere, without needing to be physically at a desk.

Part 6: The Real-Time Dashboard

Beyond the AI agent itself, I built a dashboard for the support team so they could monitor what was happening on every call.

Call statuses tracked in real-time:

🟢 On Call
🟡 On Hold
🔵 Pending (AI handling)
⏭️ Skipped (no agent picked up)
🔁 Transferred (to human agent)

I used Asterisk's AMI (Asterisk Manager Interface) — a socket-based API that streams live call events — to feed real-time data into the dashboard. Every time a call changed state (answered, transferred, ended), AMI fired an event and the dashboard updated instantly.

Call Recording and Playback

All calls were automatically recorded using Asterisk's built-in MixMonitor feature and stored as audio files. The dashboard gave the support team a simple UI to browse, filter by date or status, and play back any recorded call.

What I Learned Building This Alone

1. Phone audio is not microphone audio. Everything you know about audio quality from web projects goes out the window. 8000 Hz, mono, heavy compression — you have to design around it.

2. AI response length matters more than quality for voice. A perfectly accurate 200-word answer is useless on a phone call. The AI prompt must explicitly constrain response length.

3. Silence is a feature, not a bug. Getting silence detection thresholds right — on both the STT input side and the TTS pause side — made the difference between the agent feeling natural and feeling like a broken IVR.

4. Data quality is everything. The knowledge base I received had inconsistent formatting and some duplicate entries. Cleaning that data was 30% of the total work — garbage in, garbage out applies even more strictly when AI is speaking to real users.

5. Test with real phone calls early. The system worked perfectly in local testing. On actual phone calls, background noise revealed three bugs I had never seen in a controlled environment.

Final System Overview

Component	Technology Used
PBX / Telephony	Asterisk + FreePBX
Call Scripting	Python AGI
Speech-to-Text	Google Cloud STT
Audio Processing	SoX
AI / LLM	OpenAI GPT-4o-mini
Knowledge Base	Company-provided Q&A dataset
Text-to-Speech	Google TTS (gTTS)
Agent Mobile App	SIP client (Zoiper/Linphone)
Dashboard	Python + real-time AMI events
Call Recording	Asterisk MixMonitor

Wrapping Up

Building this system as a solo developer taught me that production AI is far less about the AI model itself and far more about the plumbing around it — audio quality, latency, data preparation, and graceful fallback when the AI doesn't know the answer.

The most rewarding moment was watching the first real call go through — a user called, the AI answered, understood the question, found the right answer from our knowledge base, and spoke it back clearly. No human involved.

If you're thinking about building something similar, start small: get Asterisk running locally, write a basic AGI script that just records and plays back audio, then layer in the AI piece once the telephony is solid.

I'm Khushi Pandya, a software engineer working on AI-driven backends, voice systems, and developer tooling. Find me on Medium | GitHub | LinkedIn

DEV Community