Mart Schweiger

Posted on Apr 8 • Originally published at assemblyai.com

How to Build a Voice Agent with Python in 5 Minutes

#python #voiceai #assemblyai #tutorial

Overview

This tutorial demonstrates creating a complete voice agent that listens, thinks, and responds naturally using Python. The application processes speech in real-time, generates intelligent responses, and speaks back to users in under 100 lines of code.

The solution combines three APIs:

AssemblyAI's Universal-3 Pro Streaming for speech-to-text
OpenAI's GPT-4 for conversational AI
ElevenLabs for voice synthesis

Prerequisites

Requirements:

Python 3.9 or higher
Three API keys (AssemblyAI, OpenAI, ElevenLabs)
Computer with microphone and speakers

Installation

pip install "assemblyai>=1.0.0" openai "elevenlabs>=1.0.0" pyaudio python-dotenv

Package purposes:

assemblyai: Real-time speech recognition with Universal-3 Pro Streaming
openai: GPT model integration
elevenlabs: Natural voice synthesis
pyaudio: Microphone access
python-dotenv: API key management

Configuration

Create a .env file:

ASSEMBLYAI_API_KEY=your_assemblyai_key_here
OPENAI_API_KEY=your_openai_key_here
ELEVENLABS_API_KEY=your_elevenlabs_key_here

Voice Agent Components

A voice agent operates through three interconnected stages:

Component	Role	Why Streaming Matters
AssemblyAI	Speech-to-text	Transcribes audio as it arrives for faster LLM response
OpenAI GPT-4	Language model	Generates responses token-by-token for immediate TTS
ElevenLabs	Text-to-speech	Plays audio while generating remaining content

Streaming architecture eliminates delays that make conversations feel unnatural. Batch processing creates awkward pauses between user input and agent response.

Speech-to-Text Setup

Create voice_agent.py:

import assemblyai as aai
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)
from dotenv import load_dotenv
import os

load_dotenv()

class VoiceAgent:
    def __init__(self):
        self.client = StreamingClient(
            StreamingClientOptions(
                api_key=os.getenv('ASSEMBLYAI_API_KEY'),
                api_host="streaming.assemblyai.com",
            )
        )

        self.client.on(StreamingEvents.Begin, self.on_begin)
        self.client.on(StreamingEvents.Turn, self.on_turn)
        self.client.on(StreamingEvents.Termination, self.on_terminated)
        self.client.on(StreamingEvents.Error, self.on_error)

        self.is_processing = False

    def on_begin(self, event: BeginEvent):
        print("Listening... Start speaking!")

    def on_turn(self, turn: TurnEvent):
        if not turn.transcript:
            return

        if turn.end_of_turn:
            print(f"You said: {turn.transcript}")
        else:
            print(f"Hearing: {turn.transcript}", end="\r")

    def on_error(self, error: StreamingError):
        print(f"Error: {error}")

    def on_terminated(self, event: TerminationEvent):
        print("Connection closed")

The system provides two transcript types: partial (incomplete sentences shown in real-time) and final (complete when punctuation-triggered pauses occur).

Language Model Integration

Add OpenAI processing to the VoiceAgent class:

from openai import OpenAI

class VoiceAgent:
    def __init__(self):
        # Previous code...

        self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

        self.conversation = [
            {"role": "system", "content": """You are a helpful voice assistant.

Keep responses short and conversational.

Talk like you're having a normal conversation with someone."""}
        ]

    def process_with_llm(self, user_text):
        self.conversation.append({"role": "user", "content": user_text})

        response_text = ""

        stream = self.openai_client.chat.completions.create(
            model="gpt-4",
            messages=self.conversation,
            stream=True,
            temperature=0.7,
            max_tokens=150
        )

        print("Assistant: ", end="")

        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                response_text += content
                print(content, end="", flush=True)

        print()

        self.conversation.append({"role": "assistant", "content": response_text})

        self.speak(response_text)

Text-to-Speech Output

from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
import threading

class VoiceAgent:
    def __init__(self):
        # Previous code...

        self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
        self.voice_id = "EXAVITQu4vr4xnSDxMaL"  # Sarah voice

    def speak(self, text):
        def generate_and_play():
            try:
                audio_stream = self.elevenlabs_client.text_to_speech.stream(
                    voice_id=self.voice_id,
                    text=text,
                    model_id="eleven_turbo_v2_5",
                )

                play_stream(audio_stream)

            except Exception as e:
                print(f"Voice error: {e}")

        thread = threading.Thread(target=generate_and_play, daemon=True)
        thread.start()

Available voices:

Sarah (EXAVITQu4vr4xnSDxMaL): Clear, professional female
Josh (TxGEqnHWrfWFTfGW9XjX): Warm, friendly male
Elli (MF3mGyEYCl7XYWbV9V6O): Young, energetic female

Complete Implementation

import assemblyai as aai
import os
import sys
import threading

from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
    TerminationEvent,
)

from elevenlabs.client import ElevenLabs
from elevenlabs import stream as play_stream
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

class VoiceAgent:

    def __init__(self):
        self.client = StreamingClient(
            StreamingClientOptions(
                api_key=os.getenv('ASSEMBLYAI_API_KEY'),
                api_host="streaming.assemblyai.com",
            )
        )

        self.client.on(StreamingEvents.Begin, self.on_begin)
        self.client.on(StreamingEvents.Turn, self.on_turn)
        self.client.on(StreamingEvents.Termination, self.on_terminated)
        self.client.on(StreamingEvents.Error, self.on_error)

        self.elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY'))
        self.openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

        self.is_processing = False
        self.voice_id = "EXAVITQu4vr4xnSDxMaL"

        self.conversation = [
            {"role": "system", "content": "You are a helpful voice assistant. Keep responses short and conversational."}
        ]

    def on_begin(self, event: BeginEvent):
        print("\n Voice Agent Ready! Start speaking...\n")

    def on_turn(self, turn: TurnEvent):
        if not turn.transcript:
            return

        if turn.end_of_turn:
            print(f"You: {turn.transcript}")

            if not self.is_processing:
                self.is_processing = True
                self.process_with_llm(turn.transcript)
                self.is_processing = False
        else:
            print(f"Listening: {turn.transcript}...", end="\r")

    def on_error(self, error: StreamingError):
        print(f"\n Error: {error}\n")

    def on_terminated(self, event: TerminationEvent):
        print("\n Voice Agent stopped\n")

    def process_with_llm(self, user_text):
        self.conversation.append({"role": "user", "content": user_text})

        response_text = ""

        stream = self.openai_client.chat.completions.create(
            model="gpt-4",
            messages=self.conversation,
            stream=True,
            temperature=0.7,
            max_tokens=150
        )

        print("Agent: ", end="")

        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                response_text += content
                print(content, end="", flush=True)

        print()

        self.conversation.append({"role": "assistant", "content": response_text})

        self.speak(response_text)

    def speak(self, text):
        def generate_and_play():
            try:
                audio_stream = self.elevenlabs_client.text_to_speech.stream(
                    voice_id=self.voice_id,
                    text=text,
                    model_id="eleven_turbo_v2_5",
                )
                play_stream(audio_stream)
            except Exception as e:
                print(f"Voice error: {e}")

        voice_thread = threading.Thread(target=generate_and_play)
        voice_thread.daemon = True
        voice_thread.start()

    def start(self):
        self.client.connect(
            StreamingParameters(
                sample_rate=16000,
                speech_model="u3-rt-pro",
            )
        )

        try:
            self.client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
        except KeyboardInterrupt:
            self.stop()

    def stop(self):
        print("\nStopping voice agent...")
        self.client.disconnect(terminate=True)
        sys.exit(0)

if __name__ == "__main__":
    agent = VoiceAgent()
    agent.start()

Running the Agent

python voice_agent.py

When "Voice Agent Ready! Start speaking..." appears, speak naturally into your microphone.

Test prompts:

"What's the weather like today?"
"Tell me a quick joke"
"Help me plan dinner"
"Explain how WiFi works simply"

Cost Estimate

Approximate hourly costs: $0.50–$1.00

AssemblyAI Universal-3 Pro Streaming: ~$0.45/hour
OpenAI GPT-4: ~$0.30/hour
ElevenLabs voice synthesis: ~$0.20/hour

Frequently Asked Questions

Do I need WebSocket knowledge?
No. The AssemblyAI Python SDK handles WebSocket connections, reconnection logic, and audio streaming protocols automatically.

Can I replace AssemblyAI with another STT service?
Technically possible, but you'd lose built-in punctuation-based turn detection and require manual WebSocket and audio streaming implementation.

Does this work with languages other than English?
Yes. Configure: StreamingParameters(speech_model="u3-rt-pro", sample_rate=16_000, prompt="Transcribe Spanish.").

Can I use this with frameworks like Pipecat or LiveKit?
Yes. AssemblyAI offers first-party integrations for Pipecat, LiveKit, Vapi, and Twilio.

DEV Community