Overview
This tutorial demonstrates creating a complete voice agent that listens, thinks, and responds naturally using Python. The application processes speech in real-time, generates intelligent responses, and speaks back to users in under 100 lines of code.
The solution combines three APIs:
- AssemblyAI's Universal-3 Pro Streaming for speech-to-text
- OpenAI's GPT-4 for conversational AI
- ElevenLabs for voice synthesis
Prerequisites
Requirements:
- Python 3.9 or higher
- Three API keys (AssemblyAI, OpenAI, ElevenLabs)
- Computer with microphone and speakers
Installation
pip install "assemblyai>=1.0.0" openai "elevenlabs>=1.0.0" pyaudio python-dotenv
Package purposes:
-
assemblyai: Real-time speech recognition with Universal-3 Pro Streaming -
openai: GPT model integration -
elevenlabs: Natural voice synthesis -
pyaudio: Microphone access -
python-dotenv: API key management
Configuration
Create a .env file:
ASSEMBLYAI_API_KEY=your_assemblyai_key_here
OPENAI_API_KEY=your_openai_key_here
ELEVENLABS_API_KEY=your_elevenlabs_key_here
Voice Agent Components
A voice agent operates through three interconnected stages:
| Component | Role | Why Streaming Matters |
|---|---|---|
| AssemblyAI | Speech-to-text | Transcribes audio as it arrives for faster LLM response |
| OpenAI GPT-4 | Language model | Generates responses token-by-token for immediate TTS |
| ElevenLabs | Text-to-speech | Plays audio while generating remaining content |
Streaming architecture eliminates delays that make conversations feel unnatural. Batch processing creates awkward pauses between user input and agent response.
Speech-to-Text Setup
Create voice_agent.py:
import assemblyai as aai
from assemblyai.streaming.v3 import (
BeginEvent,
StreamingClient,
StreamingClientOptions,
StreamingError,
StreamingEvents,
StreamingParameters,
TurnEvent,
TerminationEvent,
)
from dotenv import load_dotenv
import os
load_dotenv()
class VoiceAgent:
def __init__(self):
self.client = StreamingClient(
StreamingClientOptions(
api_key=os.getenv('ASSEMBLYAI_API_KEY'),
api_host="streaming.assemblyai.com",
)
)
self.client.on(StreamingEvents.Begin, self.on_begin)
self.client.on(StreamingEvents.Turn, self.on_turn)
self.client.on(StreamingEvents.Termination, self.on_terminated)
self.client.on(StreamingEvents.Error, self.on_error)
self.is_processing = False
def on_begin(self, event: BeginEvent):
print("Listening... Start speaking!")
def on_turn(self, turn: TurnEvent):
if not turn.transcript:
return
if turn.end_of_turn:
print(f"You said: {turn.transcript}")
else:
print(f"Hearing: {turn.transcript}", end="\r")
def on_error(self, error: StreamingError):
print(f"Error: {error}")
def on_terminated(self, event: TerminationEvent):
print("Connection closed")
The system provides two transcript types: partial (incomplete sentences shown in real-time) and final (complete when punctuation-triggered pauses occur).
Language Model Integration
Add OpenAI processing to the VoiceAgent class:
from openai import OpenAI
class VoiceAgent:
def __init__(self):
# Previous code...
self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
self.conversation = [
{"role": "system", "content": """You are a helpful voice assistant.
Keep responses short and conversational.
Talk like you're having a normal conversation with someone."""}
]
def process_with_llm(self, user_text):
self.conversation.append({"role": "user", "content": user_text})
response_text = ""
stream = self.openai_client.chat.completions.create(
model="gpt-4",
messages=self.conversation,
stream=True,
temperature=0.7,
max_tokens=150
)
print("Assistant: ", end="")
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
response_text += content
print(content, end="", flush=True)
print()
self.conversation.append({"role": "assistant", "content": response_text})
self.speak(response_text)
Text-to-Speech Output
from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
import threading
class VoiceAgent:
def __init__(self):
# Previous code...
self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
self.voice_id = "EXAVITQu4vr4xnSDxMaL" # Sarah voice
def speak(self, text):
def generate_and_play():
try:
audio_stream = self.elevenlabs_client.text_to_speech.stream(
voice_id=self.voice_id,
text=text,
model_id="eleven_turbo_v2_5",
)
play_stream(audio_stream)
except Exception as e:
print(f"Voice error: {e}")
thread = threading.Thread(target=generate_and_play, daemon=True)
thread.start()
Available voices:
- Sarah (EXAVITQu4vr4xnSDxMaL): Clear, professional female
- Josh (TxGEqnHWrfWFTfGW9XjX): Warm, friendly male
- Elli (MF3mGyEYCl7XYWbV9V6O): Young, energetic female
Complete Implementation
import assemblyai as aai
import os
import sys
import threading
from assemblyai.streaming.v3 import (
BeginEvent,
StreamingClient,
StreamingClientOptions,
StreamingError,
StreamingEvents,
StreamingParameters,
TurnEvent,
TerminationEvent,
)
from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
class VoiceAgent:
def __init__(self):
self.client = StreamingClient(
StreamingClientOptions(
api_key=os.getenv('ASSEMBLYAI_API_KEY'),
api_host="streaming.assemblyai.com",
)
)
self.client.on(StreamingEvents.Begin, self.on_begin)
self.client.on(StreamingEvents.Turn, self.on_turn)
self.client.on(StreamingEvents.Termination, self.on_terminated)
self.client.on(StreamingEvents.Error, self.on_error)
self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
self.is_processing = False
self.voice_id = "EXAVITQu4vr4xnSDxMaL"
self.conversation = [
{"role": "system", "content": "You are a helpful voice assistant. Keep responses short and conversational."}
]
def on_begin(self, event: BeginEvent):
print("\n Voice Agent Ready! Start speaking...\n")
def on_turn(self, turn: TurnEvent):
if not turn.transcript:
return
if turn.end_of_turn:
print(f"You: {turn.transcript}")
if not self.is_processing:
self.is_processing = True
self.process_with_llm(turn.transcript)
self.is_processing = False
else:
print(f"Listening: {turn.transcript}...", end="\r")
def on_error(self, error: StreamingError):
print(f"\n Error: {error}\n")
def on_terminated(self, event: TerminationEvent):
print("\n Voice Agent stopped\n")
def process_with_llm(self, user_text):
self.conversation.append({"role": "user", "content": user_text})
response_text = ""
stream = self.openai_client.chat.completions.create(
model="gpt-4",
messages=self.conversation,
stream=True,
temperature=0.7,
max_tokens=150
)
print("Agent: ", end="")
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
response_text += content
print(content, end="", flush=True)
print()
self.conversation.append({"role": "assistant", "content": response_text})
self.speak(response_text)
def speak(self, text):
def generate_and_play():
try:
audio_stream = self.elevenlabs_client.text_to_speech.stream(
voice_id=self.voice_id,
text=text,
model_id="eleven_turbo_v2_5",
)
play_stream(audio_stream)
except Exception as e:
print(f"Voice error: {e}")
voice_thread = threading.Thread(target=generate_and_play)
voice_thread.daemon = True
voice_thread.start()
def start(self):
self.client.connect(
StreamingParameters(
sample_rate=16000,
speech_model="u3-rt-pro",
)
)
try:
self.client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
except KeyboardInterrupt:
self.stop()
def stop(self):
print("\nStopping voice agent...")
self.client.disconnect(terminate=True)
sys.exit(0)
if __name__ == "__main__":
agent = VoiceAgent()
agent.start()
Running the Agent
python voice_agent.py
When "Voice Agent Ready! Start speaking..." appears, speak naturally into your microphone.
Test prompts:
- "What's the weather like today?"
- "Tell me a quick joke"
- "Help me plan dinner"
- "Explain how WiFi works simply"
Cost Estimate
Approximate hourly costs: $0.50–$1.00
- AssemblyAI Universal-3 Pro Streaming: ~$0.45/hour
- OpenAI GPT-4: ~$0.30/hour
- ElevenLabs voice synthesis: ~$0.20/hour
Frequently Asked Questions
Do I need WebSocket knowledge?
No. The AssemblyAI Python SDK handles WebSocket connections, reconnection logic, and audio streaming protocols automatically.
Can I replace AssemblyAI with another STT service?
Technically possible, but you'd lose built-in punctuation-based turn detection and require manual WebSocket and audio streaming implementation.
Does this work with languages other than English?
Yes. Configure: StreamingParameters(speech_model="u3-rt-pro", sample_rate=16_000, prompt="Transcribe Spanish.").
Can I use this with frameworks like Pipecat or LiveKit?
Yes. AssemblyAI offers first-party integrations for Pipecat, LiveKit, Vapi, and Twilio.
Top comments (0)