DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

How to Easily Build a Voice Agent with AssemblyAI

What Is an AI Voice Agent?

Voice agents are software systems that engage in natural speech conversations with users. Unlike traditional phone menus requiring button presses, these agents process your speech as you talk, understanding your words before you finish speaking.

Core capabilities:

  • Real-time streaming speech-to-text
  • Language model comprehension
  • Text-to-speech synthesis
  • Conversation flow orchestration

How AI Voice Agents Work

The system processes conversations through a real-time pipeline with target response times under one second:

Component Duration Purpose
Audio Capture <50ms Clean input
Speech-to-text 200-400ms Accuracy foundation
LLM Processing 300-600ms Intelligent responses
Text-to-speech 200-400ms Natural output
Audio Playback <50ms Smooth flow

Core Components

1. Streaming Speech-to-Text

AssemblyAI's Universal-3 Pro Streaming achieves approximately 94% accuracy across varying audio conditions. Accuracy thresholds matter:

  • Below 90%: Users experience frustration
  • 90-93%: Functional but with occasional errors
  • 93%+: Natural conversations with rare corrections

2. LLM and Orchestration Layer

The language model serves as the agent's "brain," managing:

  • Intent recognition
  • Context tracking
  • Function calling for system integration
  • Response generation

3. Text-to-Speech Synthesis

ElevenLabs, Google Cloud, and OpenAI offer natural-sounding voices. The key is starting speech generation before the language model finishes writing the complete response for real-time flow.

4. Integration and Business Logic

Voice agents connect to existing systems (CRM, calendars, payment processors, inventory management) with security considerations for API keys, encryption, and user authentication.

Performance Requirements

Target metrics for production agents:

  • Total response time: <1000ms
  • Speech accuracy: 93%+
  • Name recognition: 95%+
  • Number accuracy: 95%+
  • Voice quality: Human-like

Common Use Cases

  • Customer support automation: Answer FAQs, check order status, escalate complex issues
  • Appointment scheduling: Check availability, confirm details, send confirmations
  • Lead qualification: Gather information, understand needs, route appropriately
  • After-hours service: Extend availability beyond business hours

Implementation Guide

Step 1: Environment Setup

mkdir voice-agent
cd voice-agent
python -m venv venv
source venv/bin/activate  # Mac/Linux
# venv\Scripts\activate  # Windows

pip install assemblyai openai elevenlabs websockets pyaudio python-dotenv
Enter fullscreen mode Exit fullscreen mode

Create a .env file:

ASSEMBLYAI_API_KEY=your_key
OPENAI_API_KEY=your_key
ELEVENLABS_API_KEY=your_key
Enter fullscreen mode Exit fullscreen mode

Step 2: Audio Capture Class

class AudioCapture:
    def __init__(self, sample_rate=16000):
        self.sample_rate = sample_rate
        self.chunk_size = 8000
        self.audio_queue = Queue()
        self.recording = False
        self.audio = pyaudio.PyAudio()
        self.stream = None

    def start_recording(self):
        self.recording = True
        self.stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk_size
        )
        thread = threading.Thread(target=self._capture_audio, daemon=True)
        thread.start()
Enter fullscreen mode Exit fullscreen mode

Step 3: Streaming Speech-to-Text Integration

def handle_transcript(self, transcript: aai.RealtimeTranscript):
    if not transcript.text:
        return

    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(f"You: {transcript.text}")
        self.conversation_history.append(
            {"role": "user", "content": transcript.text}
        )
Enter fullscreen mode Exit fullscreen mode

Step 4: LLM Response Generation

def generate_and_speak_response(self):
    messages = [{
        "role": "system",
        "content": "Keep responses conversational and concise—aim for 1-2 sentences."
    }] + self.conversation_history

    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=messages,
        temperature=0.7,
        max_tokens=150
    )
    ai_response = response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Step 5: Text-to-Speech

def speak_text(self, text):
    audio = elevenlabs_client.generate(
        text=text,
        voice="Rachel",
        model="eleven_monolingual_v1"
    )
    stream(audio)
Enter fullscreen mode Exit fullscreen mode

Complete Working Code

import os
import threading
from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio

load_dotenv()

aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))

class AudioCapture:
    def __init__(self, sample_rate=16000):
        self.sample_rate = sample_rate
        self.chunk_size = 8000
        self.audio_queue = Queue()
        self.recording = False
        self.audio = pyaudio.PyAudio()
        self.stream = None

    def start_recording(self):
        self.recording = True
        self.stream = self.audio.open(
            format=pyaudio.paInt16, channels=1, rate=self.sample_rate,
            input=True, frames_per_buffer=self.chunk_size
        )
        thread = threading.Thread(target=self._capture_audio, daemon=True)
        thread.start()

    def _capture_audio(self):
        while self.recording:
            try:
                self.audio_queue.put(
                    self.stream.read(self.chunk_size, exception_on_overflow=False)
                )
            except Exception as e:
                print(f"Audio error: {e}")

class VoiceAgent:
    def __init__(self):
        self.conversation_history = []
        self.is_processing = False
        self.audio_capture = AudioCapture()

    def handle_transcript(self, transcript: aai.RealtimeTranscript):
        if not transcript.text:
            return

        if isinstance(transcript, aai.RealtimeFinalTranscript):
            print(f"You: {transcript.text}")
            self.conversation_history.append(
                {"role": "user", "content": transcript.text}
            )
            if not self.is_processing:
                self.is_processing = True
                threading.Thread(
                    target=self.generate_and_speak_response, daemon=True
                ).start()

    def generate_and_speak_response(self):
        try:
            messages = [{
                "role": "system",
                "content": "You are a helpful voice assistant. Keep responses conversational."
            }] + self.conversation_history

            response = openai_client.chat.completions.create(
                model="gpt-4", messages=messages, temperature=0.7, max_tokens=150
            )

            ai_response = response.choices[0].message.content
            print(f"Agent: {ai_response}")
            self.conversation_history.append(
                {"role": "assistant", "content": ai_response}
            )

            audio = elevenlabs_client.generate(
                text=ai_response,
                voice="Rachel",
                model="eleven_monolingual_v1"
            )
            stream(audio)
        finally:
            self.is_processing = False

    def start_conversation(self):
        self.transcriber = aai.RealtimeTranscriber(
            sample_rate=16000,
            on_data=self.handle_transcript,
            on_error=lambda e: print(f"Speech error: {e}")
        )

        try:
            self.transcriber.connect()
            self.audio_capture.start_recording()
            print("Voice Agent ready - start speaking!")
            while True:
                audio_chunk = self.audio_capture.get_audio_data()
                if audio_chunk:
                    self.transcriber.stream(audio_chunk)
        except KeyboardInterrupt:
            self.audio_capture.stop_recording()
            self.transcriber.close()

if __name__ == "__main__":
    VoiceAgent().start_conversation()
Enter fullscreen mode Exit fullscreen mode

Run with: python voice_agent.py

Production Considerations

Key requirements beyond prototypes:

  • Telephony integration (Twilio, SIP trunking)
  • Concurrent conversation handling
  • Error recovery for network failures
  • Performance monitoring and metrics
  • Security compliance and data protection

Infrastructure scaling involves WebSocket connection pooling, load balancing, database integration, and comprehensive monitoring.

Frequently Asked Questions

What response time targets matter?
Target under 1000ms total: ~200-400ms for speech recognition, ~300-600ms for language processing, ~200-400ms for synthesis, and ~100-200ms for network delays.

How do I handle user interruptions?
Implement barge-in detection by monitoring audio streams during agent responses and stopping text-to-speech when speech is detected. AssemblyAI's streaming API enables smooth interruption handling.

Which language model is best?
GPT-4 handles complex conversations requiring nuanced understanding; GPT-3.5 Turbo works for simpler interactions with lower latency and cost.

Do I need custom speech recognition models?
Modern pre-trained models handle most use cases without custom training. Universal-3 Pro achieves production accuracy out-of-the-box. Use custom vocabulary features for specialized terminology.

How do I integrate with phone systems?
Cloud telephony providers like Twilio offer the easiest integration, or implement SIP trunking for direct infrastructure connection.

Top comments (0)