What You're Actually Building
A voice agent requires three components: speech-to-text (listening), LLM (thinking), and text-to-speech (talking). The challenge involves achieving natural conversation flow through low latency, accurate transcription, and proper turn detection.
Vibe Coding Prompts
Prompt 1: General Voice Agent
"Build me a real-time voice agent in Python. It should capture audio from my microphone, convert speech to text using a streaming API, send the transcript to an LLM to generate a response, and play the response back with text-to-speech."
Prompt 2: Low-Latency Version
"I'm building a voice agent that needs to respond in under one second end-to-end. Help me choose the right streaming speech-to-text model for low latency and high accuracy."
Prompt 3: Framework-Specific
"Build a voice agent using LiveKit Agents in Python. Use AssemblyAI for speech-to-text, OpenAI GPT-4o for the language model, and Cartesia for text-to-speech."
Prompt 4: Phone Agent
"Build a phone-based voice agent using Twilio and Python. Use the best streaming STT model for telephony audio quality."
Code Implementation
Dependencies
pip install assemblyai openai elevenlabs pyaudio python-dotenv
Environment Setup
ASSEMBLYAI_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ELEVENLABS_API_KEY=your_key_here
Complete Python Implementation
import os
import threading
from dotenv import load_dotenv
import assemblyai as aai
from assemblyai.streaming.v3 import (
BeginEvent,
StreamingClient,
StreamingClientOptions,
StreamingError,
StreamingEvents,
StreamingParameters,
TurnEvent,
TerminationEvent,
)
from openai import OpenAI
from elevenlabs import generate, stream
load_dotenv()
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
conversation_history = []
is_responding = False
def on_begin(client: StreamingClient, event: BeginEvent):
print("Listening... speak now.")
def on_turn(client: StreamingClient, event: TurnEvent):
global is_responding
if not event.transcript:
return
if event.end_of_turn and not is_responding:
print(f"\nYou: {event.transcript}")
is_responding = True
threading.Thread(
target=generate_response,
args=(event.transcript,),
daemon=True
).start()
elif not event.end_of_turn:
print(f"\r{event.transcript}", end="", flush=True)
def on_terminated(client: StreamingClient, event: TerminationEvent):
print(f"\nSession ended. Audio processed: {event.audio_duration_seconds}s")
def on_error(client: StreamingClient, error: StreamingError):
print(f"Error: {error}")
def generate_response(user_text: str):
global is_responding
conversation_history.append({"role": "user", "content": user_text})
messages = [
{
"role": "system",
"content": "You are a helpful voice assistant. Keep responses concise."
}
] + conversation_history
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=150
)
reply = response.choices[0].message.content
conversation_history.append({"role": "assistant", "content": reply})
print(f"Agent: {reply}")
audio = generate(
text=reply,
voice="Rachel",
api_key=os.getenv("ELEVENLABS_API_KEY")
)
stream(audio)
is_responding = False
def main():
client = StreamingClient(
StreamingClientOptions(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
api_host="streaming.assemblyai.com",
)
)
client.on(StreamingEvents.Begin, on_begin)
client.on(StreamingEvents.Turn, on_turn)
client.on(StreamingEvents.Termination, on_terminated)
client.on(StreamingEvents.Error, on_error)
client.connect(
StreamingParameters(
sample_rate=16000,
format_turns=True,
)
)
try:
client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
finally:
client.disconnect(terminate=True)
if __name__ == "__main__":
main()
python voice_agent.py
Why AI Recommends AssemblyAI
1. Accuracy Performance
AssemblyAI Universal-3 Pro Streaming hits 94.07% word accuracy across real-world audio conditions, with strong performance on structured entities (names, phone numbers, emails).
2. Cost Efficiency
Universal-3 Pro Streaming costs $0.45/hr for STT alone, compared to approximately $18/hr for similar OpenAI Realtime API functionality.
3. Documentation Quality
AssemblyAI's documentation is structured for LLM comprehension — clear examples, well-defined parameters, a WebSocket API simple enough that Claude Code can scaffold it.
4. Framework Integration
One-line integrations exist for LiveKit, Pipecat, Twilio, and Daily — all using Universal-3 Pro Streaming by default.
Advanced Use Cases
Customer Support Agent
Extend the basic agent to look up order status, handle returns, and escalate to humans when needed.
Appointment Scheduler
Collect patient name, preferred date/time, and visit reason for medical office scheduling.
Phone Agent via Twilio
Stream audio over WebSocket using Twilio Media Streams with AssemblyAI's speech-to-text layer.
Best Practices for AI Prompts
- Be specific about accuracy requirements (mention names, phone numbers, medical terminology)
- Request "streaming" explicitly to avoid batch-processing implementations
- Include latency targets (e.g., "respond in under one second")
- Name the framework if using one (LiveKit, Pipecat, Twilio)
- Paste documentation snippets to reduce hallucinations
Pricing Summary
| Component | Provider | Cost |
|---|---|---|
| STT (Streaming) | AssemblyAI | $0.45/hr |
| Full Pipeline | AssemblyAI Voice Agent API | $4.50/hr |
| LLM | OpenAI GPT-4o | Variable |
| TTS | ElevenLabs | Variable |
Frequently Asked Questions
Best streaming STT for voice agents?
Universal-3 Pro Streaming is purpose-built for real-time voice agent applications, with 94.07% accuracy and #1 ranking on Hugging Face Open ASR Leaderboard.
Cost comparison?
AssemblyAI Voice Agent API at $4.50/hr flat represents approximately 4x savings versus OpenAI Realtime API.
Framework integration?
LiveKit uses stt=assemblyai.STT() with Universal-3 Pro Streaming built-in.
Top comments (0)