Raw WebSocket Voice Agent with AssemblyAI Universal-3 Pro Streaming
The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro).
This shows exactly what LiveKit Agents, Pipecat, and Vapi are doing underneath. If you want full control over every byte, or you're embedding a voice agent into a custom application, start here.
The Pipeline
Microphone
│ float32 audio (sounddevice)
▼ convert → int16 PCM
AssemblyAI WebSocket (wss://streaming.assemblyai.com/v3/ws)
│ ?speech_model=u3-rt-pro&encoding=pcm_s16le&sample_rate=16000
│ Turn message (end_of_turn=true) — neural turn detection
▼
OpenAI GPT-4o → text response
▼
ElevenLabs TTS → PCM audio → Speakers (sounddevice)
Prerequisites
- Python 3.11+
- Microphone and speakers
- AssemblyAI API key
- OpenAI API key
- ElevenLabs API key
macOS: brew install portaudio
Quick Start
git clone https://github.com/kelseyefoster/voice-agent-websocket-universal-3-pro
cd voice-agent-websocket-universal-3-pro
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
python agent.py
WebSocket Connection
AAI_WS_URL = (
"wss://streaming.assemblyai.com/v3/ws"
"?speech_model=u3-rt-pro"
"&encoding=pcm_s16le"
"&sample_rate=16000"
"&end_of_turn_confidence_threshold=0.4"
f"&token={ASSEMBLYAI_API_KEY}"
)
Message Types
AssemblyAI v3 streams three event types:
{ "type": "Begin", "id": "session_abc123" }
{ "type": "Turn", "transcript": "how do I", "end_of_turn": false }
{ "type": "Turn", "transcript": "how do I get started?", "end_of_turn": true }
{ "type": "Termination" }
Sending Audio
# Raw PCM bytes — no wrapper, no base64
await ws.send(pcm_bytes)
# Terminate cleanly
await ws.send(json.dumps({"type": "Terminate"}))
Turn Detection Tuning
| Setting | Effect |
|---|---|
Lower end_of_turn_confidence_threshold (0.3) |
Faster response, more false triggers |
Higher end_of_turn_confidence_threshold (0.6) |
More patient, better for noisy environments |
Lower min_turn_silence (200ms) |
Snappier for fast-paced conversation |
Higher max_turn_silence (2000ms) |
Better for deliberate speech |
Swapping Components
Use Claude instead of GPT-4o:
from anthropic import AsyncAnthropic
client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = await client.messages.create(model="claude-opus-4-6", max_tokens=150, ...)
Use Cartesia for lower TTS latency:
import cartesia
Top comments (0)