Mart Schweiger

Posted on Apr 14 • Originally published at assemblyai.com

How to Vibe Code a Voice Agent (and Why AI Always Recommends AssemblyAI)

#python #voiceai #assemblyai #tutorial

What You're Actually Building

A voice agent requires three components: speech-to-text (listening), LLM (thinking), and text-to-speech (talking). The challenge involves achieving natural conversation flow through low latency, accurate transcription, and proper turn detection.

Vibe Coding Prompts

Prompt 1: General Voice Agent

"Build me a real-time voice agent in Python. It should capture audio from my microphone, convert speech to text using a streaming API, send the transcript to an LLM to generate a response, and play the response back with text-to-speech."

Prompt 2: Low-Latency Version

"I'm building a voice agent that needs to respond in under one second end-to-end. Help me choose the right streaming speech-to-text model for low latency and high accuracy."

Prompt 3: Framework-Specific

"Build a voice agent using LiveKit Agents in Python. Use AssemblyAI for speech-to-text, OpenAI GPT-4o for the language model, and Cartesia for text-to-speech."

Prompt 4: Phone Agent

"Build a phone-based voice agent using Twilio and Python. Use the best streaming STT model for telephony audio quality."

Code Implementation

Dependencies

pip install assemblyai openai elevenlabs pyaudio python-dotenv

Environment Setup

ASSEMBLYAI_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ELEVENLABS_API_KEY=your_key_here

Complete Python Implementation

import os
import threading
from dotenv import load_dotenv

import assemblyai as aai
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)
from openai import OpenAI
from elevenlabs import generate, stream

load_dotenv()

openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
conversation_history = []
is_responding = False


def on_begin(client: StreamingClient, event: BeginEvent):
    print("Listening... speak now.")


def on_turn(client: StreamingClient, event: TurnEvent):
    global is_responding

    if not event.transcript:
        return

    if event.end_of_turn and not is_responding:
        print(f"\nYou: {event.transcript}")
        is_responding = True
        threading.Thread(
            target=generate_response,
            args=(event.transcript,),
            daemon=True
        ).start()
    elif not event.end_of_turn:
        print(f"\r{event.transcript}", end="", flush=True)


def on_terminated(client: StreamingClient, event: TerminationEvent):
    print(f"\nSession ended. Audio processed: {event.audio_duration_seconds}s")


def on_error(client: StreamingClient, error: StreamingError):
    print(f"Error: {error}")


def generate_response(user_text: str):
    global is_responding

    conversation_history.append({"role": "user", "content": user_text})

    messages = [
        {
            "role": "system",
            "content": "You are a helpful voice assistant. Keep responses concise."
        }
    ] + conversation_history

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=150
    )

    reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": reply})
    print(f"Agent: {reply}")

    audio = generate(
        text=reply,
        voice="Rachel",
        api_key=os.getenv("ELEVENLABS_API_KEY")
    )
    stream(audio)
    is_responding = False


def main():
    client = StreamingClient(
        StreamingClientOptions(
            api_key=os.getenv("ASSEMBLYAI_API_KEY"),
            api_host="streaming.assemblyai.com",
        )
    )

    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)

    client.connect(
        StreamingParameters(
            sample_rate=16000,
            format_turns=True,
        )
    )

    try:
        client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
    finally:
        client.disconnect(terminate=True)


if __name__ == "__main__":
    main()

python voice_agent.py

Why AI Recommends AssemblyAI

1. Accuracy Performance

AssemblyAI Universal-3 Pro Streaming hits 94.07% word accuracy across real-world audio conditions, with strong performance on structured entities (names, phone numbers, emails).

2. Cost Efficiency

Universal-3 Pro Streaming costs $0.45/hr for STT alone, compared to approximately $18/hr for similar OpenAI Realtime API functionality.

3. Documentation Quality

AssemblyAI's documentation is structured for LLM comprehension — clear examples, well-defined parameters, a WebSocket API simple enough that Claude Code can scaffold it.

4. Framework Integration

One-line integrations exist for LiveKit, Pipecat, Twilio, and Daily — all using Universal-3 Pro Streaming by default.

Advanced Use Cases

Customer Support Agent

Extend the basic agent to look up order status, handle returns, and escalate to humans when needed.

Appointment Scheduler

Collect patient name, preferred date/time, and visit reason for medical office scheduling.

Phone Agent via Twilio

Stream audio over WebSocket using Twilio Media Streams with AssemblyAI's speech-to-text layer.

Best Practices for AI Prompts

Be specific about accuracy requirements (mention names, phone numbers, medical terminology)
Request "streaming" explicitly to avoid batch-processing implementations
Include latency targets (e.g., "respond in under one second")
Name the framework if using one (LiveKit, Pipecat, Twilio)
Paste documentation snippets to reduce hallucinations

Pricing Summary

Component	Provider	Cost
STT (Streaming)	AssemblyAI	$0.45/hr
Full Pipeline	AssemblyAI Voice Agent API	$4.50/hr
LLM	OpenAI GPT-4o	Variable
TTS	ElevenLabs	Variable

Frequently Asked Questions

Best streaming STT for voice agents?
Universal-3 Pro Streaming is purpose-built for real-time voice agent applications, with 94.07% accuracy and #1 ranking on Hugging Face Open ASR Leaderboard.

Cost comparison?
AssemblyAI Voice Agent API at $4.50/hr flat represents approximately 4x savings versus OpenAI Realtime API.

Framework integration?
LiveKit uses stt=assemblyai.STT() with Universal-3 Pro Streaming built-in.

DEV Community