Building real-time AI voice agents over actual phone calls is one of the hardest engineering problems you can take on. The latency requirements are brutal (humans notice any delay over ~500ms), the audio pipeline is full of edge cases, and coordinating three different external services — telephony, speech recognition, and LLM — in real time requires careful architectural thinking.
I built a production-ready, low-latency AI telephony agent from scratch. Here's the full technical breakdown — architecture, implementation details, and the lessons learned along the way.
What This System Does
When someone calls your Twilio phone number, this system:
- Captures the incoming audio stream in real time via Twilio Media Streams (WebSockets)
- Streams the audio to Deepgram Nova-2 for sub-second speech-to-text transcription
- Sends the transcript to Groq's Llama-3.3-70b for contextually aware response generation
- Converts the LLM response to natural-sounding speech using Deepgram Aura TTS
- Streams the audio back to the caller in 20ms frames
- Monitors LLM output for emergency trigger phrases — and redirects the call instantly if detected
System Architecture
Tech Stack Breakdown
| Layer | Technology | Why |
|---|---|---|
| Telephony | Twilio Media Streams | Battle-tested, global infrastructure with WSS support |
| STT | Deepgram Nova-2 | Best-in-class accuracy + sub-second latency on 8kHz audio |
| LLM | Groq Llama-3.3-70b | Fastest inference available — critical for real-time voice |
| TTS | Deepgram Aura | Low-latency, natural-sounding speech synthesis |
| Server | FastAPI + WebSockets | Async-first, handles concurrent connections cleanly |
Why Groq? The Latency Problem
This is the most important architectural decision in the whole system. In a voice conversation, you have maybe 300–400ms of budget for the entire round-trip from when speech ends to when the response starts playing. Breaking that down:
That's already 310ms with zero slack. Standard LLM APIs would blow this budget entirely. Groq's purpose-built LPU (Language Processing Unit) hardware is what makes real-time voice agents feasible — it's genuinely 10–20x faster than GPU-based inference for token generation speed.
Key insight: For voice AI, LLM inference speed matters more than model size. A faster, smaller model (Llama-3.3-70b on Groq) will always outperform a slower, larger model for real-time telephony.
The Audio Pipeline: Technical Specifications
Twilio's Media Streams deliver audio in a very specific format that the entire pipeline is built around:
- Encoding: 8-bit PCMU (G.711 mu-law) — the standard for telephony
- Sample rate: 8000 Hz — lower than modern audio, but universal across phone networks
- Channel: Mono
- Frame size: 160 bytes = 20ms of audio per WebSocket message
Deepgram Nova-2 handles 8kHz mu-law natively — resampling on the fly would add latency. The TTS output from Deepgram Aura is similarly fragmented into 20ms frames for smooth playback through the telephony channel.
Emergency Triage Logic
One of the most critical features for production deployment is the emergency fallback system. The LLM output monitor runs concurrently with response generation and watches for a configurable set of trigger phrases.
When a trigger is detected:
- Current audio playback is interrupted
- The system calls the Twilio REST API immediately
- The call is redirected to the configured
EMERGENCY_FALLBACK_NUMBER - The event is logged for auditing
This is essential for any real-world deployment — medical triage, mental health lines, technical support escalation — where certain situations require immediate human intervention rather than continued AI interaction.
Project Structure
AI-voice-agent/
├── app/
│ └── core/
│ └── config.py # SYSTEM_PROMPT lives here — customize behavior
├── .env.example # All required environment variables documented
├── requirements.txt
└── run.py # Single entry point — manages full lifecycle
The unified run.py entry point is a deliberate design decision: it manages ngrok tunnel setup, Twilio webhook synchronization, and FastAPI startup in the correct order — deployment is a single command.
Customizing the Agent's Behavior
The agent's entire personality and domain expertise is controlled by a single system prompt in app/core/config.py. This makes it trivially easy to redeploy the same infrastructure for completely different use cases:
# Medical triage agent
SYSTEM_PROMPT = """You are a medical intake assistant...
Emergency triggers: ['chest pain', 'can't breathe', 'unconscious']"""
# Technical support agent
SYSTEM_PROMPT = """You are a tier-1 technical support agent...
Escalation triggers: ['billing issue', 'data loss', 'security breach']"""
# Appointment scheduling agent
SYSTEM_PROMPT = """You are a scheduling assistant..."""
Deployment
git clone https://github.com/Sameershahh/AI-voice-agent
cd AI-voice-agent
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Configure credentials
cp .env.example .env
# Fill in GROQ_API_KEY, DEEPGRAM_API_KEY, TWILIO_ACCOUNT_SID,
# TWILIO_AUTH_TOKEN, TWILIO_PHONE_NUMBER, EMERGENCY_FALLBACK_NUMBER, PUBLIC_URL
# Start everything
python run.py
Session Logging
All session interactions and transcripts are automatically persisted to the logs/ directory. This is non-optional for production — you need a complete audit trail for compliance, debugging, and performance analysis. Logs include full call transcripts, LLM responses, latency measurements, and any emergency triage events.
Production Use Cases
- Medical triage — AI handles initial intake, escalates critical cases to on-call staff
- Technical support — tier-1 resolution with intelligent escalation
- Appointment scheduling — natural conversation flow for booking and rescheduling
- Lead qualification — automated inbound sales calls with CRM integration
- Emergency hotlines — AI-assisted triage with guaranteed human escalation path
Resources
If you're building voice AI infrastructure or have questions about latency optimization, WebSocket audio pipelines, or the emergency triage implementation — drop a comment. There aren't many publicly documented implementations of this full stack yet, and I'm happy to go deeper on any part of it.
Built by Sameer Shah — AI & Full-Stack Developer | Portfolio


Top comments (0)