Daily.co Voice Agent with AssemblyAI Universal-3 Pro Streaming
Build a WebRTC voice agent using Daily.co for real-time audio transport and the AssemblyAI Universal-3 Pro Streaming model for speech-to-text — without Pipecat.
This is the bare-metal Daily.co integration. It shows exactly how Daily's audio tracks connect to the AssemblyAI WebSocket — useful when you want to embed a voice agent into a custom Daily.co application without pulling in a full pipeline framework.
Architecture
Browser / Phone (Daily.co room participant)
│ WebRTC audio
▼
Daily.co room
│ PCM audio via daily-python SDK
▼
This bot (daily-python)
│ raw PCM bytes
▼
AssemblyAI Universal-3 Pro WebSocket
│ transcript + neural turn signal
▼
OpenAI GPT-4o → Cartesia TTS → PCM audio
▼
Bot sends audio back into Daily room
Prerequisites
- Python 3.11+
- AssemblyAI API key
- Daily.co API key
- OpenAI API key
- Cartesia API key
Quick Start
git clone https://github.com/kelseyefoster/voice-agent-dailyco-universal-3-pro
cd voice-agent-dailyco-universal-3-pro
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your API keys
# Create a room and bot token
python create_room.py
# Start the bot
python bot.py --room-url https://yourname.daily.co/room --token <bot-token>
Open the room URL in your browser and start speaking.
How Audio Flows
Daily.co calls on_audio_data whenever a remote participant speaks. The bot forwards raw PCM bytes directly to the AssemblyAI WebSocket — no conversion needed at 16 kHz:
def on_audio_data(self, participant_id, audio_data, sample_rate, num_channels):
if self.aai_ws and not self.aai_ws.closed:
asyncio.create_task(self.aai_ws.send(audio_data))
When Universal-3 Pro detects an end-of-turn, the bot generates a response with GPT-4o, synthesizes audio with Cartesia, and injects it back into the room.
AssemblyAI Connection Parameters
AAI_WS_URL = (
"wss://streaming.assemblyai.com/v3/ws"
"?speech_model=u3-rt-pro"
"&encoding=pcm_s16le"
f"&sample_rate={SAMPLE_RATE}"
"&end_of_turn_confidence_threshold=0.4"
"&min_turn_silence=300"
f"&token={ASSEMBLYAI_API_KEY}"
)
Why Direct Daily.co Instead of Pipecat?
Pipecat excels for production voice agents with complex pipelines (VAD, context management, interruption handling). This tutorial targets those who want bare-metal access to primitives — useful when embedding a voice agent into a custom Daily.co application without full framework overhead.
Top comments (0)