Mart Schweiger

Posted on Apr 3 • Originally published at assemblyai.com

Agora Voice Agent with AssemblyAI Universal-3 Pro Streaming

#python #ai #tutorial #assemblyai

Agora Voice Agent with AssemblyAI Universal-3 Pro Streaming

Build a real-time transcription bot that joins Agora channels, captures participant audio as PCM frames, and streams it to AssemblyAI Universal-3 Pro Streaming — with 307ms P50 latency and support for 99+ languages.

Architecture

Browser/Mobile clients
        │ WebRTC (Agora SDK)
        ▼
   Agora Channel
        │ server subscribes as bot user
        ▼
  Python Server Bot
  (agora-python-server-sdk)
        │ PcmAudioFrame per participant
        │ sample_rate=16000, pcm_s16le
        ▼
  AssemblyAI Universal-3 Pro Streaming
  wss://streaming.assemblyai.com/v3/ws
        │ Turn events with transcript
        ▼
  Your application logic
  (drive LLM, store transcript, trigger webhook)

Why Agora + AssemblyAI?

Metric	AssemblyAI Universal-3 Pro	Agora Built-in STT
P50 latency	307ms	~600–900ms
Word Error Rate	8.9%	~14–18%
Speaker diarization	✅ Real-time	❌
LLM Gateway	✅ 20+ models	❌
Languages	99+	Limited
Audio formats	PCM, μ-law, Opus	PCM only

Prerequisites

Python 3.9+
Agora account — App ID and App Certificate
AssemblyAI API key

Quick Start

git clone https://github.com/kelseyefoster/voice-agent-agora-universal-3-pro
cd voice-agent-agora-universal-3-pro

pip install -r requirements.txt
cp .env.example .env
# Fill in AGORA_APP_ID, AGORA_APP_CERT, ASSEMBLYAI_API_KEY

python bot.py --channel my-channel

Environment Setup

AGORA_APP_ID=your_agora_app_id
AGORA_APP_CERT=your_agora_certificate
AGORA_CHANNEL=my-channel
AGORA_BOT_UID=9999
ASSEMBLYAI_API_KEY=your_assemblyai_api_key

Obtain Agora credentials from the Agora Console and your AssemblyAI API key from the AssemblyAI dashboard.

Core Integration

The bot operates concurrently for each participant: pulling audio frames from Agora, forwarding them to AssemblyAI, and handling transcript events.

import asyncio
import json
import os
import websockets
from agora.rtc.agora_service import AgoraService, AgoraServiceConfig
from agora.rtc.rtc_connection import RTCConnConfig
from agora.rtc.agora_base import (
    ClientRoleType,
    ChannelProfileType,
    AudioScenarioType,
)

SAMPLE_RATE = 16000
CHANNELS    = 1
AAI_WS_URL  = (
    "wss://streaming.assemblyai.com/v3/ws"
    f"?sample_rate={SAMPLE_RATE}"
    "&speech_model=u3-rt-pro"
    "&format_turns=true"
)

async def stream_participant(agora_channel, uid: int, api_key: str):
    headers = {"Authorization": api_key}
    async with websockets.connect(AAI_WS_URL, additional_headers=headers) as ws:
        begin = json.loads(await ws.recv())
        print(f"[uid={uid}] AAI session: {begin['id']}")

        async def send_audio():
            async for frame in agora_channel.get_audio_frames(uid):
                await ws.send(frame.data)

        async def recv_transcripts():
            async for message in ws:
                event = json.loads(message)
                if event["type"] == "Turn" and event.get("end_of_turn"):
                    print(f"[uid={uid}] {event['transcript']}")

        await asyncio.gather(send_audio(), recv_transcripts())

Audio Format

Configure Agora to output 16 kHz mono before subscribing — this eliminates resampling and matches AssemblyAI's preferred format:

agora_channel.set_playback_audio_frame_before_mixing_parameters(
    num_of_channels=1,
    sample_rate=16000,
)
agora_channel.subscribe_all_audio()

Each PcmAudioFrame contains 160 samples (10ms) of 16-bit little-endian PCM. AssemblyAI streams them directly without buffering.

Handling Transcripts

The Turn event fires at natural speech boundaries. Route it to your LLM, database, or webhook:

async def recv_transcripts(ws, uid: int):
    async for message in ws:
        event = json.loads(message)
        if event["type"] == "Turn" and event.get("end_of_turn"):
            transcript = event["transcript"]
            print(f"[uid={uid}] {transcript}")
            await send_to_llm(uid, transcript)

Terminating Cleanly

Send a Terminate message to flush the final turn:

async def close_stream(ws):
    await ws.send(json.dumps({"type": "Terminate"}))
    async for message in ws:
        event = json.loads(message)
        if event["type"] == "Termination":
            print(f"Audio processed: {event['audio_duration_seconds']}s")
            break

Production Token Generation

pip install agora-token-builder

from agora_token_builder import RtcTokenBuilder, Role_Subscriber
import time

def generate_bot_token(app_id: str, app_cert: str, channel: str, uid: int) -> str:
    expire = int(time.time()) + 3600
    return RtcTokenBuilder.buildTokenWithUid(
        app_id, app_cert, channel, uid, Role_Subscriber, expire
    )

token = generate_bot_token(
    os.environ["AGORA_APP_ID"],
    os.environ["AGORA_APP_CERT"],
    channel,
    bot_uid,
)
connection.connect(token, channel, str(bot_uid))

DEV Community

Agora Voice Agent with AssemblyAI Universal-3 Pro Streaming

Agora Voice Agent with AssemblyAI Universal-3 Pro Streaming

Architecture

Why Agora + AssemblyAI?

Prerequisites

Quick Start

Environment Setup

Core Integration

Audio Format

Handling Transcripts

Terminating Cleanly

Production Token Generation

Resources

Top comments (0)