Daily.co Voice Agent with AssemblyAI Universal-3 Pro Streaming

#python #webrtc #ai #tutorial

Daily.co Voice Agent with AssemblyAI Universal-3 Pro Streaming

Build a WebRTC voice agent using Daily.co for real-time audio transport and the AssemblyAI Universal-3 Pro Streaming model for speech-to-text — without Pipecat.

This is the bare-metal Daily.co integration. It shows exactly how Daily's audio tracks connect to the AssemblyAI WebSocket — useful when you want to embed a voice agent into a custom Daily.co application without pulling in a full pipeline framework.

Architecture

Browser / Phone (Daily.co room participant)
        │ WebRTC audio
        ▼
  Daily.co room
        │ PCM audio via daily-python SDK
        ▼
  This bot (daily-python)
        │ raw PCM bytes
        ▼
  AssemblyAI Universal-3 Pro WebSocket
        │ transcript + neural turn signal
        ▼
  OpenAI GPT-4o → Cartesia TTS → PCM audio
        ▼
  Bot sends audio back into Daily room

Prerequisites

Quick Start

git clone https://github.com/kelseyefoster/voice-agent-dailyco-universal-3-pro
cd voice-agent-dailyco-universal-3-pro

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env with your API keys

# Create a room and bot token
python create_room.py

# Start the bot
python bot.py --room-url https://yourname.daily.co/room --token <bot-token>

Open the room URL in your browser and start speaking.

How Audio Flows

Daily.co calls on_audio_data whenever a remote participant speaks. The bot forwards raw PCM bytes directly to the AssemblyAI WebSocket — no conversion needed at 16 kHz:

def on_audio_data(self, participant_id, audio_data, sample_rate, num_channels):
    if self.aai_ws and not self.aai_ws.closed:
        asyncio.create_task(self.aai_ws.send(audio_data))

When Universal-3 Pro detects an end-of-turn, the bot generates a response with GPT-4o, synthesizes audio with Cartesia, and injects it back into the room.

AssemblyAI Connection Parameters

AAI_WS_URL = (
    "wss://streaming.assemblyai.com/v3/ws"
    "?speech_model=u3-rt-pro"
    "&encoding=pcm_s16le"
    f"&sample_rate={SAMPLE_RATE}"
    "&end_of_turn_confidence_threshold=0.4"
    "&min_turn_silence=300"
    f"&token={ASSEMBLYAI_API_KEY}"
)

Why Direct Daily.co Instead of Pipecat?

Pipecat excels for production voice agents with complex pipelines (VAD, context management, interruption handling). This tutorial targets those who want bare-metal access to primitives — useful when embedding a voice agent into a custom Daily.co application without full framework overhead.