Sameer Shah

Posted on Apr 13

Building an Enterprise-Grade AI Voice Agent with Twilio, Deepgram, and Groq Llama-3.3 (Real-Time Telephony Automation)

#ai #machinelearning #webdev #opensource

Building real-time AI voice agents over actual phone calls is one of the hardest engineering problems you can take on. The latency requirements are brutal (humans notice any delay over ~500ms), the audio pipeline is full of edge cases, and coordinating three different external services — telephony, speech recognition, and LLM — in real time requires careful architectural thinking.

I built a production-ready, low-latency AI telephony agent from scratch. Here's the full technical breakdown — architecture, implementation details, and the lessons learned along the way.

What This System Does

When someone calls your Twilio phone number, this system:

Captures the incoming audio stream in real time via Twilio Media Streams (WebSockets)
Streams the audio to Deepgram Nova-2 for sub-second speech-to-text transcription
Sends the transcript to Groq's Llama-3.3-70b for contextually aware response generation
Converts the LLM response to natural-sounding speech using Deepgram Aura TTS
Streams the audio back to the caller in 20ms frames
Monitors LLM output for emergency trigger phrases — and redirects the call instantly if detected

System Architecture

Tech Stack Breakdown

Layer	Technology	Why
Telephony	Twilio Media Streams	Battle-tested, global infrastructure with WSS support
STT	Deepgram Nova-2	Best-in-class accuracy + sub-second latency on 8kHz audio
LLM	Groq Llama-3.3-70b	Fastest inference available — critical for real-time voice
TTS	Deepgram Aura	Low-latency, natural-sounding speech synthesis
Server	FastAPI + WebSockets	Async-first, handles concurrent connections cleanly

Why Groq? The Latency Problem

This is the most important architectural decision in the whole system. In a voice conversation, you have maybe 300–400ms of budget for the entire round-trip from when speech ends to when the response starts playing. Breaking that down:

That's already 310ms with zero slack. Standard LLM APIs would blow this budget entirely. Groq's purpose-built LPU (Language Processing Unit) hardware is what makes real-time voice agents feasible — it's genuinely 10–20x faster than GPU-based inference for token generation speed.

Key insight: For voice AI, LLM inference speed matters more than model size. A faster, smaller model (Llama-3.3-70b on Groq) will always outperform a slower, larger model for real-time telephony.

The Audio Pipeline: Technical Specifications

Twilio's Media Streams deliver audio in a very specific format that the entire pipeline is built around:

Encoding: 8-bit PCMU (G.711 mu-law) — the standard for telephony
Sample rate: 8000 Hz — lower than modern audio, but universal across phone networks
Channel: Mono
Frame size: 160 bytes = 20ms of audio per WebSocket message

Deepgram Nova-2 handles 8kHz mu-law natively — resampling on the fly would add latency. The TTS output from Deepgram Aura is similarly fragmented into 20ms frames for smooth playback through the telephony channel.

Emergency Triage Logic

One of the most critical features for production deployment is the emergency fallback system. The LLM output monitor runs concurrently with response generation and watches for a configurable set of trigger phrases.

When a trigger is detected:

Current audio playback is interrupted
The system calls the Twilio REST API immediately
The call is redirected to the configured EMERGENCY_FALLBACK_NUMBER
The event is logged for auditing

This is essential for any real-world deployment — medical triage, mental health lines, technical support escalation — where certain situations require immediate human intervention rather than continued AI interaction.

Project Structure

AI-voice-agent/
├── app/
│   └── core/
│       └── config.py    # SYSTEM_PROMPT lives here — customize behavior
├── .env.example          # All required environment variables documented
├── requirements.txt
└── run.py                # Single entry point — manages full lifecycle

The unified run.py entry point is a deliberate design decision: it manages ngrok tunnel setup, Twilio webhook synchronization, and FastAPI startup in the correct order — deployment is a single command.

Customizing the Agent's Behavior

The agent's entire personality and domain expertise is controlled by a single system prompt in app/core/config.py. This makes it trivially easy to redeploy the same infrastructure for completely different use cases:

# Medical triage agent
SYSTEM_PROMPT = """You are a medical intake assistant...
Emergency triggers: ['chest pain', 'can't breathe', 'unconscious']"""

# Technical support agent
SYSTEM_PROMPT = """You are a tier-1 technical support agent...
Escalation triggers: ['billing issue', 'data loss', 'security breach']"""

# Appointment scheduling agent
SYSTEM_PROMPT = """You are a scheduling assistant..."""

Deployment

git clone https://github.com/Sameershahh/AI-voice-agent
cd AI-voice-agent
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Configure credentials
cp .env.example .env
# Fill in GROQ_API_KEY, DEEPGRAM_API_KEY, TWILIO_ACCOUNT_SID,
# TWILIO_AUTH_TOKEN, TWILIO_PHONE_NUMBER, EMERGENCY_FALLBACK_NUMBER, PUBLIC_URL

# Start everything
python run.py

Session Logging

All session interactions and transcripts are automatically persisted to the logs/ directory. This is non-optional for production — you need a complete audit trail for compliance, debugging, and performance analysis. Logs include full call transcripts, LLM responses, latency measurements, and any emergency triage events.

Production Use Cases

Medical triage — AI handles initial intake, escalates critical cases to on-call staff
Technical support — tier-1 resolution with intelligent escalation
Appointment scheduling — natural conversation flow for booking and rescheduling
Lead qualification — automated inbound sales calls with CRM integration
Emergency hotlines — AI-assisted triage with guaranteed human escalation path

Resources

If you're building voice AI infrastructure or have questions about latency optimization, WebSocket audio pipelines, or the emergency triage implementation — drop a comment. There aren't many publicly documented implementations of this full stack yet, and I'm happy to go deeper on any part of it.

Built by Sameer Shah — AI & Full-Stack Developer | Portfolio

DEV Community