Mart Schweiger

Posted on Apr 3 • Originally published at assemblyai.com

Raw WebSocket Voice Agent with AssemblyAI Universal-3 Pro Streaming

#python #websocket #ai #tutorial

Raw WebSocket Voice Agent with AssemblyAI Universal-3 Pro Streaming

The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro).

This shows exactly what LiveKit Agents, Pipecat, and Vapi are doing underneath. If you want full control over every byte, or you're embedding a voice agent into a custom application, start here.

The Pipeline

Microphone
    │ float32 audio (sounddevice)
    ▼ convert → int16 PCM
AssemblyAI WebSocket (wss://streaming.assemblyai.com/v3/ws)
    │ ?speech_model=u3-rt-pro&encoding=pcm_s16le&sample_rate=16000
    │ Turn message (end_of_turn=true) — neural turn detection
    ▼
OpenAI GPT-4o → text response
    ▼
ElevenLabs TTS → PCM audio → Speakers (sounddevice)

Prerequisites

Python 3.11+
Microphone and speakers
AssemblyAI API key
OpenAI API key
ElevenLabs API key

macOS: brew install portaudio

Quick Start

git clone https://github.com/kelseyefoster/voice-agent-websocket-universal-3-pro
cd voice-agent-websocket-universal-3-pro

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
python agent.py

WebSocket Connection

AAI_WS_URL = (
    "wss://streaming.assemblyai.com/v3/ws"
    "?speech_model=u3-rt-pro"
    "&encoding=pcm_s16le"
    "&sample_rate=16000"
    "&end_of_turn_confidence_threshold=0.4"
    f"&token={ASSEMBLYAI_API_KEY}"
)

Message Types

AssemblyAI v3 streams three event types:

{ "type": "Begin", "id": "session_abc123" }
{ "type": "Turn", "transcript": "how do I", "end_of_turn": false }
{ "type": "Turn", "transcript": "how do I get started?", "end_of_turn": true }
{ "type": "Termination" }

Sending Audio

# Raw PCM bytes — no wrapper, no base64
await ws.send(pcm_bytes)

# Terminate cleanly
await ws.send(json.dumps({"type": "Terminate"}))

Turn Detection Tuning

Setting	Effect
Lower `end_of_turn_confidence_threshold` (0.3)	Faster response, more false triggers
Higher `end_of_turn_confidence_threshold` (0.6)	More patient, better for noisy environments
Lower `min_turn_silence` (200ms)	Snappier for fast-paced conversation
Higher `max_turn_silence` (2000ms)	Better for deliberate speech

Swapping Components

Use Claude instead of GPT-4o:

from anthropic import AsyncAnthropic
client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = await client.messages.create(model="claude-opus-4-6", max_tokens=150, ...)

Use Cartesia for lower TTS latency:

import cartesia

DEV Community

Raw WebSocket Voice Agent with AssemblyAI Universal-3 Pro Streaming

Raw WebSocket Voice Agent with AssemblyAI Universal-3 Pro Streaming

The Pipeline

Prerequisites

Quick Start

WebSocket Connection

Message Types

Sending Audio

Turn Detection Tuning

Swapping Components

Resources

Top comments (0)