DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

Agora Voice Agent with AssemblyAI Universal-3 Pro Streaming

Agora Voice Agent with AssemblyAI Universal-3 Pro Streaming

Build a real-time transcription bot that joins Agora channels, captures participant audio as PCM frames, and streams it to AssemblyAI Universal-3 Pro Streaming — with 307ms P50 latency and support for 99+ languages.

Architecture

Browser/Mobile clients
        │ WebRTC (Agora SDK)
        ▼
   Agora Channel
        │ server subscribes as bot user
        ▼
  Python Server Bot
  (agora-python-server-sdk)
        │ PcmAudioFrame per participant
        │ sample_rate=16000, pcm_s16le
        ▼
  AssemblyAI Universal-3 Pro Streaming
  wss://streaming.assemblyai.com/v3/ws
        │ Turn events with transcript
        ▼
  Your application logic
  (drive LLM, store transcript, trigger webhook)
Enter fullscreen mode Exit fullscreen mode

Why Agora + AssemblyAI?

Metric AssemblyAI Universal-3 Pro Agora Built-in STT
P50 latency 307ms ~600–900ms
Word Error Rate 8.9% ~14–18%
Speaker diarization ✅ Real-time
LLM Gateway ✅ 20+ models
Languages 99+ Limited
Audio formats PCM, μ-law, Opus PCM only

Prerequisites

Quick Start

git clone https://github.com/kelseyefoster/voice-agent-agora-universal-3-pro
cd voice-agent-agora-universal-3-pro

pip install -r requirements.txt
cp .env.example .env
# Fill in AGORA_APP_ID, AGORA_APP_CERT, ASSEMBLYAI_API_KEY

python bot.py --channel my-channel
Enter fullscreen mode Exit fullscreen mode

Environment Setup

AGORA_APP_ID=your_agora_app_id
AGORA_APP_CERT=your_agora_certificate
AGORA_CHANNEL=my-channel
AGORA_BOT_UID=9999
ASSEMBLYAI_API_KEY=your_assemblyai_api_key
Enter fullscreen mode Exit fullscreen mode

Obtain Agora credentials from the Agora Console and your AssemblyAI API key from the AssemblyAI dashboard.

Core Integration

The bot operates concurrently for each participant: pulling audio frames from Agora, forwarding them to AssemblyAI, and handling transcript events.

import asyncio
import json
import os
import websockets
from agora.rtc.agora_service import AgoraService, AgoraServiceConfig
from agora.rtc.rtc_connection import RTCConnConfig
from agora.rtc.agora_base import (
    ClientRoleType,
    ChannelProfileType,
    AudioScenarioType,
)

SAMPLE_RATE = 16000
CHANNELS    = 1
AAI_WS_URL  = (
    "wss://streaming.assemblyai.com/v3/ws"
    f"?sample_rate={SAMPLE_RATE}"
    "&speech_model=u3-rt-pro"
    "&format_turns=true"
)

async def stream_participant(agora_channel, uid: int, api_key: str):
    headers = {"Authorization": api_key}
    async with websockets.connect(AAI_WS_URL, additional_headers=headers) as ws:
        begin = json.loads(await ws.recv())
        print(f"[uid={uid}] AAI session: {begin['id']}")

        async def send_audio():
            async for frame in agora_channel.get_audio_frames(uid):
                await ws.send(frame.data)

        async def recv_transcripts():
            async for message in ws:
                event = json.loads(message)
                if event["type"] == "Turn" and event.get("end_of_turn"):
                    print(f"[uid={uid}] {event['transcript']}")

        await asyncio.gather(send_audio(), recv_transcripts())
Enter fullscreen mode Exit fullscreen mode

Audio Format

Configure Agora to output 16 kHz mono before subscribing — this eliminates resampling and matches AssemblyAI's preferred format:

agora_channel.set_playback_audio_frame_before_mixing_parameters(
    num_of_channels=1,
    sample_rate=16000,
)
agora_channel.subscribe_all_audio()
Enter fullscreen mode Exit fullscreen mode

Each PcmAudioFrame contains 160 samples (10ms) of 16-bit little-endian PCM. AssemblyAI streams them directly without buffering.

Handling Transcripts

The Turn event fires at natural speech boundaries. Route it to your LLM, database, or webhook:

async def recv_transcripts(ws, uid: int):
    async for message in ws:
        event = json.loads(message)
        if event["type"] == "Turn" and event.get("end_of_turn"):
            transcript = event["transcript"]
            print(f"[uid={uid}] {transcript}")
            await send_to_llm(uid, transcript)
Enter fullscreen mode Exit fullscreen mode

Terminating Cleanly

Send a Terminate message to flush the final turn:

async def close_stream(ws):
    await ws.send(json.dumps({"type": "Terminate"}))
    async for message in ws:
        event = json.loads(message)
        if event["type"] == "Termination":
            print(f"Audio processed: {event['audio_duration_seconds']}s")
            break
Enter fullscreen mode Exit fullscreen mode

Production Token Generation

pip install agora-token-builder
Enter fullscreen mode Exit fullscreen mode
from agora_token_builder import RtcTokenBuilder, Role_Subscriber
import time

def generate_bot_token(app_id: str, app_cert: str, channel: str, uid: int) -> str:
    expire = int(time.time()) + 3600
    return RtcTokenBuilder.buildTokenWithUid(
        app_id, app_cert, channel, uid, Role_Subscriber, expire
    )

token = generate_bot_token(
    os.environ["AGORA_APP_ID"],
    os.environ["AGORA_APP_CERT"],
    channel,
    bot_uid,
)
connection.connect(token, channel, str(bot_uid))
Enter fullscreen mode Exit fullscreen mode

Resources

Top comments (0)