DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

Raw WebSocket Voice Agent with AssemblyAI Universal-3 Pro Streaming

Raw WebSocket Voice Agent with AssemblyAI Universal-3 Pro Streaming

The simplest possible voice agent — no frameworks, no abstraction layers. Just raw WebSockets, a microphone, and the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro).

This shows exactly what LiveKit Agents, Pipecat, and Vapi are doing underneath. If you want full control over every byte, or you're embedding a voice agent into a custom application, start here.

The Pipeline

Microphone
    │ float32 audio (sounddevice)
    ▼ convert → int16 PCM
AssemblyAI WebSocket (wss://streaming.assemblyai.com/v3/ws)
    │ ?speech_model=u3-rt-pro&encoding=pcm_s16le&sample_rate=16000
    │ Turn message (end_of_turn=true) — neural turn detection
    ▼
OpenAI GPT-4o → text response
    ▼
ElevenLabs TTS → PCM audio → Speakers (sounddevice)
Enter fullscreen mode Exit fullscreen mode

Prerequisites

  • Python 3.11+
  • Microphone and speakers
  • AssemblyAI API key
  • OpenAI API key
  • ElevenLabs API key

macOS: brew install portaudio

Quick Start

git clone https://github.com/kelseyefoster/voice-agent-websocket-universal-3-pro
cd voice-agent-websocket-universal-3-pro

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
python agent.py
Enter fullscreen mode Exit fullscreen mode

WebSocket Connection

AAI_WS_URL = (
    "wss://streaming.assemblyai.com/v3/ws"
    "?speech_model=u3-rt-pro"
    "&encoding=pcm_s16le"
    "&sample_rate=16000"
    "&end_of_turn_confidence_threshold=0.4"
    f"&token={ASSEMBLYAI_API_KEY}"
)
Enter fullscreen mode Exit fullscreen mode

Message Types

AssemblyAI v3 streams three event types:

{ "type": "Begin", "id": "session_abc123" }
{ "type": "Turn", "transcript": "how do I", "end_of_turn": false }
{ "type": "Turn", "transcript": "how do I get started?", "end_of_turn": true }
{ "type": "Termination" }
Enter fullscreen mode Exit fullscreen mode

Sending Audio

# Raw PCM bytes — no wrapper, no base64
await ws.send(pcm_bytes)

# Terminate cleanly
await ws.send(json.dumps({"type": "Terminate"}))
Enter fullscreen mode Exit fullscreen mode

Turn Detection Tuning

Setting Effect
Lower end_of_turn_confidence_threshold (0.3) Faster response, more false triggers
Higher end_of_turn_confidence_threshold (0.6) More patient, better for noisy environments
Lower min_turn_silence (200ms) Snappier for fast-paced conversation
Higher max_turn_silence (2000ms) Better for deliberate speech

Swapping Components

Use Claude instead of GPT-4o:

from anthropic import AsyncAnthropic
client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = await client.messages.create(model="claude-opus-4-6", max_tokens=150, ...)
Enter fullscreen mode Exit fullscreen mode

Use Cartesia for lower TTS latency:

import cartesia
Enter fullscreen mode Exit fullscreen mode

Resources

Top comments (0)