Mart Schweiger

Posted on Apr 8 • Originally published at assemblyai.com

How to Easily Build a Voice Agent with AssemblyAI

#python #voiceai #assemblyai #tutorial

What Is an AI Voice Agent?

Voice agents are software systems that engage in natural speech conversations with users. Unlike traditional phone menus requiring button presses, these agents process your speech as you talk, understanding your words before you finish speaking.

Core capabilities:

Real-time streaming speech-to-text
Language model comprehension
Text-to-speech synthesis
Conversation flow orchestration

How AI Voice Agents Work

The system processes conversations through a real-time pipeline with target response times under one second:

Component	Duration	Purpose
Audio Capture	<50ms	Clean input
Speech-to-text	200-400ms	Accuracy foundation
LLM Processing	300-600ms	Intelligent responses
Text-to-speech	200-400ms	Natural output
Audio Playback	<50ms	Smooth flow

Core Components

1. Streaming Speech-to-Text

AssemblyAI's Universal-3 Pro Streaming achieves approximately 94% accuracy across varying audio conditions. Accuracy thresholds matter:

Below 90%: Users experience frustration
90-93%: Functional but with occasional errors
93%+: Natural conversations with rare corrections

2. LLM and Orchestration Layer

The language model serves as the agent's "brain," managing:

Intent recognition
Context tracking
Function calling for system integration
Response generation

3. Text-to-Speech Synthesis

ElevenLabs, Google Cloud, and OpenAI offer natural-sounding voices. The key is starting speech generation before the language model finishes writing the complete response for real-time flow.

4. Integration and Business Logic

Voice agents connect to existing systems (CRM, calendars, payment processors, inventory management) with security considerations for API keys, encryption, and user authentication.

Performance Requirements

Target metrics for production agents:

Total response time: <1000ms
Speech accuracy: 93%+
Name recognition: 95%+
Number accuracy: 95%+
Voice quality: Human-like

Common Use Cases

Customer support automation: Answer FAQs, check order status, escalate complex issues
Appointment scheduling: Check availability, confirm details, send confirmations
Lead qualification: Gather information, understand needs, route appropriately
After-hours service: Extend availability beyond business hours

Implementation Guide

Step 1: Environment Setup

mkdir voice-agent
cd voice-agent
python -m venv venv
source venv/bin/activate  # Mac/Linux
# venv\Scripts\activate  # Windows

pip install assemblyai openai elevenlabs websockets pyaudio python-dotenv

Create a .env file:

ASSEMBLYAI_API_KEY=your_key
OPENAI_API_KEY=your_key
ELEVENLABS_API_KEY=your_key

Step 2: Audio Capture Class

class AudioCapture:
    def __init__(self, sample_rate=16000):
        self.sample_rate = sample_rate
        self.chunk_size = 8000
        self.audio_queue = Queue()
        self.recording = False
        self.audio = pyaudio.PyAudio()
        self.stream = None

    def start_recording(self):
        self.recording = True
        self.stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk_size
        )
        thread = threading.Thread(target=self._capture_audio, daemon=True)
        thread.start()

Step 3: Streaming Speech-to-Text Integration

def handle_transcript(self, transcript: aai.RealtimeTranscript):
    if not transcript.text:
        return

    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"You: {transcript.text}")
        self.conversation_history.append(
            {"role": "user", "content": transcript.text}
        )

Step 4: LLM Response Generation

def generate_and_speak_response(self):
    messages = [{
        "role": "system",
        "content": "Keep responses conversational and concise—aim for 1-2 sentences."
    }] + self.conversation_history

    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=messages,
        temperature=0.7,
        max_tokens=150
    )
    ai_response = response.choices[0].message.content

Step 5: Text-to-Speech

def speak_text(self, text):
    audio = elevenlabs_client.generate(
        text=text,
        voice="Rachel",
        model="eleven_monolingual_v1"
    )
    stream(audio)

Complete Working Code

import os
import threading
from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio

load_dotenv()

aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))

class AudioCapture:
    def __init__(self, sample_rate=16000):
        self.sample_rate = sample_rate
        self.chunk_size = 8000
        self.audio_queue = Queue()
        self.recording = False
        self.audio = pyaudio.PyAudio()
        self.stream = None

    def start_recording(self):
        self.recording = True
        self.stream = self.audio.open(
            format=pyaudio.paInt16, channels=1, rate=self.sample_rate,
            input=True, frames_per_buffer=self.chunk_size
        )
        thread = threading.Thread(target=self._capture_audio, daemon=True)
        thread.start()

    def _capture_audio(self):
        while self.recording:
            try:
                self.audio_queue.put(
                    self.stream.read(self.chunk_size, exception_on_overflow=False)
                )
            except Exception as e:
                print(f"Audio error: {e}")

class VoiceAgent:
    def __init__(self):
        self.conversation_history = []
        self.is_processing = False
        self.audio_capture = AudioCapture()

    def handle_transcript(self, transcript: aai.RealtimeTranscript):
        if not transcript.text:
            return

        if isinstance(transcript, aai.RealtimeFinalTranscript):
            print(f"You: {transcript.text}")
            self.conversation_history.append(
                {"role": "user", "content": transcript.text}
            )
            if not self.is_processing:
                self.is_processing = True
                threading.Thread(
                    target=self.generate_and_speak_response, daemon=True
                ).start()

    def generate_and_speak_response(self):
        try:
            messages = [{
                "role": "system",
                "content": "You are a helpful voice assistant. Keep responses conversational."
            }] + self.conversation_history

            response = openai_client.chat.completions.create(
                model="gpt-4", messages=messages, temperature=0.7, max_tokens=150
            )

            ai_response = response.choices[0].message.content
            print(f"Agent: {ai_response}")
            self.conversation_history.append(
                {"role": "assistant", "content": ai_response}
            )

            audio = elevenlabs_client.generate(
                text=ai_response,
                voice="Rachel",
                model="eleven_monolingual_v1"
            )
            stream(audio)
        finally:
            self.is_processing = False

    def start_conversation(self):
        self.transcriber = aai.RealtimeTranscriber(
            sample_rate=16000,
            on_data=self.handle_transcript,
            on_error=lambda e: print(f"Speech error: {e}")
        )

        try:
            self.transcriber.connect()
            self.audio_capture.start_recording()
            print("Voice Agent ready - start speaking!")
            while True:
                audio_chunk = self.audio_capture.get_audio_data()
                if audio_chunk:
                    self.transcriber.stream(audio_chunk)
        except KeyboardInterrupt:
            self.audio_capture.stop_recording()
            self.transcriber.close()

if __name__ == "__main__":
    VoiceAgent().start_conversation()

Run with: python voice_agent.py

Production Considerations

Key requirements beyond prototypes:

Telephony integration (Twilio, SIP trunking)
Concurrent conversation handling
Error recovery for network failures
Performance monitoring and metrics
Security compliance and data protection

Infrastructure scaling involves WebSocket connection pooling, load balancing, database integration, and comprehensive monitoring.

Frequently Asked Questions

What response time targets matter?
Target under 1000ms total: ~200-400ms for speech recognition, ~300-600ms for language processing, ~200-400ms for synthesis, and ~100-200ms for network delays.

How do I handle user interruptions?
Implement barge-in detection by monitoring audio streams during agent responses and stopping text-to-speech when speech is detected. AssemblyAI's streaming API enables smooth interruption handling.

Which language model is best?
GPT-4 handles complex conversations requiring nuanced understanding; GPT-3.5 Turbo works for simpler interactions with lower latency and cost.

Do I need custom speech recognition models?
Modern pre-trained models handle most use cases without custom training. Universal-3 Pro achieves production accuracy out-of-the-box. Use custom vocabulary features for specialized terminology.

How do I integrate with phone systems?
Cloud telephony providers like Twilio offer the easiest integration, or implement SIP trunking for direct infrastructure connection.

DEV Community