DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

Daily.co Voice Agent with AssemblyAI Universal-3 Pro Streaming

Daily.co Voice Agent with AssemblyAI Universal-3 Pro Streaming

Build a WebRTC voice agent using Daily.co for real-time audio transport and the AssemblyAI Universal-3 Pro Streaming model for speech-to-text — without Pipecat.

This is the bare-metal Daily.co integration. It shows exactly how Daily's audio tracks connect to the AssemblyAI WebSocket — useful when you want to embed a voice agent into a custom Daily.co application without pulling in a full pipeline framework.

Architecture

Browser / Phone (Daily.co room participant)
        │ WebRTC audio
        ▼
  Daily.co room
        │ PCM audio via daily-python SDK
        ▼
  This bot (daily-python)
        │ raw PCM bytes
        ▼
  AssemblyAI Universal-3 Pro WebSocket
        │ transcript + neural turn signal
        ▼
  OpenAI GPT-4o → Cartesia TTS → PCM audio
        ▼
  Bot sends audio back into Daily room
Enter fullscreen mode Exit fullscreen mode

Prerequisites

Quick Start

git clone https://github.com/kelseyefoster/voice-agent-dailyco-universal-3-pro
cd voice-agent-dailyco-universal-3-pro

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env with your API keys

# Create a room and bot token
python create_room.py

# Start the bot
python bot.py --room-url https://yourname.daily.co/room --token <bot-token>
Enter fullscreen mode Exit fullscreen mode

Open the room URL in your browser and start speaking.

How Audio Flows

Daily.co calls on_audio_data whenever a remote participant speaks. The bot forwards raw PCM bytes directly to the AssemblyAI WebSocket — no conversion needed at 16 kHz:

def on_audio_data(self, participant_id, audio_data, sample_rate, num_channels):
    if self.aai_ws and not self.aai_ws.closed:
        asyncio.create_task(self.aai_ws.send(audio_data))
Enter fullscreen mode Exit fullscreen mode

When Universal-3 Pro detects an end-of-turn, the bot generates a response with GPT-4o, synthesizes audio with Cartesia, and injects it back into the room.

AssemblyAI Connection Parameters

AAI_WS_URL = (
    "wss://streaming.assemblyai.com/v3/ws"
    "?speech_model=u3-rt-pro"
    "&encoding=pcm_s16le"
    f"&sample_rate={SAMPLE_RATE}"
    "&end_of_turn_confidence_threshold=0.4"
    "&min_turn_silence=300"
    f"&token={ASSEMBLYAI_API_KEY}"
)
Enter fullscreen mode Exit fullscreen mode

Why Direct Daily.co Instead of Pipecat?

Pipecat excels for production voice agents with complex pipelines (VAD, context management, interruption handling). This tutorial targets those who want bare-metal access to primitives — useful when embedding a voice agent into a custom Daily.co application without full framework overhead.

Resources

Top comments (0)