DEV Community: Mart Schweiger

Best API for building a speech-to-speech voice agent in 2026

Mart Schweiger — Tue, 19 May 2026 19:28:14 +0000

A speech-to-speech voice agent API replaces the three separate components most teams used to wire together—streaming speech-to-text, a language model, and text-to-speech—with a single API that takes audio in and returns audio out. In 2026, that category has gone from "interesting demo" to "default way to ship a production voice agent," and the gap between providers is now measurable in latency, accuracy, and what they let you do with tool calls.

This guide compares the speech-to-speech voice agent APIs developers actually pick from in 2026, what each one is best at, and how to choose between a true speech-to-speech API and a chained STT-LLM-TTS pipeline. We'll cover AssemblyAI's Voice Agent API, OpenAI Realtime, Google Gemini Live, Deepgram, ElevenLabs Conversational AI, Retell, Bland, and Hume, plus where Vapi and Pipecat fit if you'd rather orchestrate the components yourself—covered in our orchestration tools comparison.

What is a speech-to-speech voice agent API?

A speech-to-speech voice agent API is a single API endpoint—usually a WebSocket—that accepts a user's audio stream and returns the agent's audio response, with everything in between (transcription, reasoning, tool calls, voice synthesis) hidden behind one connection. You send mic audio in. You get the agent's voice back. You don't manage three providers, three sets of API keys, or three sets of latency budgets.

That's the practical definition. Under the hood, there are two architectural patterns:

Chained (cascading) speech-to-speech APIs : Internally pipe streaming STT → LLM → streaming TTS, but expose a single API. The advantage is you can swap each layer for best-in-class models. AssemblyAI's Voice Agent API is the leading example.
Native speech-to-speech models : A single model trained end-to-end on audio that takes audio tokens in and emits audio tokens out, with no intermediate text in some cases. OpenAI Realtime, Google Gemini Live, and Hume's EVI fall here. The pitch is lower latency and richer audio understanding (laughter, tone). The trade-off is less transparency, smaller language support, and weaker text reasoning than a frontier text LLM.

Both expose the same developer surface—one connection, audio in/audio out—so the choice is about which trade-offs match your application.

Best speech-to-speech voice agent APIs in 2026

API	Architecture	Speech accuracy	P50 latency	Tool calling	Languages	Pricing	Best for
AssemblyAI Voice Agent API	Chained, single WebSocket	Industry-leading on phone audio, alphanumerics (16.7% missed entity rate)	307ms STT + sub-second end-to-end	Yes, model-routed, with intermediate speech (no silence during tool calls)	6 streaming (EN/ES/FR/DE/IT/PT), expanding	$4.50/hr flat	Production voice agents where speech accuracy decides whether it ships
OpenAI Realtime API	Native speech-to-speech (GPT-4o audio)	Strong on clean studio audio, weaker on telephony (23.3% missed entity rate)	~500–800ms end-to-end	Yes, OpenAI tool format (goes silent during tool calls)	~50 (varies by feature)	~$18/hr per-token billing across 30+ event types	Demos, browser-first apps, conversational toys
Deepgram Voice Agent API	Chained, cascading	Good general accuracy, weaker on entities (25.5% missed entity rate)	~1–1.5 seconds end-to-end	Yes, custom functions supported (goes silent during tool calls)	EN, ES, NL, FR, DE, IT, JA	~$4.50/hr, concurrency commitments required	Teams already invested in Deepgram's ecosystem
Google Gemini Live API	Native speech-to-speech (Gemini 2 audio)	Strong on Google's voice eval set	~600–900ms end-to-end	Yes	30+	Usage-based, varies by tier	Apps already on GCP / Gemini, multimodal (vision + voice) demos
ElevenLabs Conversational AI	Chained, ElevenLabs-orchestrated	Depends on STT chosen (configurable)	Sub-second end-to-end	Yes	30+	Per-minute, ~$0.09–0.30/min	Teams that want premium TTS as the headline and don't want to tune STT
Retell	Chained, orchestrated	Configurable STT	Sub-500ms voice-to-voice on phone	Yes	20+	Per-minute, ~$0.07–0.17/min	Phone-first agents prioritizing turn-taking naturalness
Bland	Chained, self-hostable	Configurable	Sub-second	Yes	10+	Per-minute or self-hosted	Enterprises with strict data residency / on-prem requirements
Hume EVI	Native speech-to-speech, emotion-aware	Decent	Sub-second	Yes	English-focused	Per-minute	Emotion-sensitive use cases (mental health, coaching)
Vapi	Orchestration (not S2S, but feels like it)	Depends on chosen STT	Sub-second when tuned	Yes	Wide	Per-minute + pass-through provider costs	Teams that want to swap STT/LLM/TTS per-deployment
Pipecat / LiveKit Agents	Open-source orchestration	Depends on STT	Sub-second when tuned	Yes	Wide	Compute + provider costs	Teams who want full ownership of the pipeline

A few things stand out in 2026. End-to-end latency under one second is now table stakes, not a differentiator—every provider on this list will get you there with a reasonable network. What separates them is speech accuracy on real-world audio (phone calls, accents, alphanumerics), how tool calling behaves under load , and whether the pricing model survives contact with a real customer base.

In our Voice Agent Report, 76% of respondents rated speech-to-text accuracy as the single most important non-negotiable when building voice agents—above latency, cost, and integration capabilities. That finding maps directly to what we see in the comparison data: the accuracy gap between providers on real-world entities (phone numbers, emails, confirmation codes) is where production agents succeed or fail.

How to choose the best speech-to-speech voice agent API

The voice agent you ship depends on four decisions. Get any of them wrong and the agent feels off, even if the demo was great.

1. Speech-to-text accuracy on your actual audio

Most providers benchmark on studio audio. Your users are on phones, in cars, in drive-thrus, and rattling off order numbers and email addresses. The two accuracy metrics that actually matter:

Alphanumeric accuracy : How well the model captures phone numbers, confirmation codes, emails, order IDs. This is where the gap between providers shows up most clearly. In head-to-head testing, AssemblyAI's Universal-3 Pro Streaming delivers a 16.7% alphanumeric missed error rate, compared to 23.3% for OpenAI and 25.5% for Deepgram. That's the difference between capturing "RX-7704132" correctly on the first try and hearing "dash seven seven zero four one three two." AssemblyAI's Universal-3 Pro Streaming also delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. This is the single most under-measured metric in voice agent demos.
Entity accuracy on proper nouns : Company names, people's names, drug names, product titles. If your agent writes "Corel" instead of "Coral" into the CRM, the lead is unreachable.

Native speech-to-speech models like OpenAI Realtime and Gemini Live were trained more on clean conversational audio than on telephony, which shows up the moment you put them on a Twilio call.

2. Turn-taking and interruption handling

Poor turn detection is the most common reason voice agents feel unnatural. The agent either talks over the user or sits in awkward silence. The best implementations handle turn detection at the model level, not as an afterthought bolted on with a fixed silence timer.

AssemblyAI's Universal-3 Pro Streaming includes acoustic turn detection built directly into the model, with semantic endpointing that combines acoustic pauses with intent signals—using a semantic + neural network + VAD approach rather than basic silence-based VAD. Retell ships its own proprietary turn-taking model. OpenAI Realtime's server VAD is competent but configurable timeouts still trip up agents on calls with hesitant speakers. Deepgram relies on traditional VAD only, without the semantic or neural layers.

3. Tool calling reliability

Real voice agents don't just talk—they book the appointment, look up the order, charge the card. That means the underlying LLM has to call tools mid-conversation, fast enough that the silence doesn't become obvious.

The bar to clear: tool calls under 500ms round-trip, structured outputs that don't hallucinate parameters, and the ability to call multiple tools in a single turn. But there's a UX dimension most teams overlook: what happens while the tool call is executing? AssemblyAI's Voice Agent API generates intermediate speech during tool execution—the agent says something like "Let me look that up for you" rather than going silent. Both OpenAI Realtime and Deepgram go silent during tool calls, which creates an awkward dead-air gap that makes users wonder if the connection dropped.

AssemblyAI's Voice Agent API exposes a clean function-calling surface that routes through the underlying model with structured-output guarantees. OpenAI Realtime supports tool calling natively. Some orchestration platforms add their own retry and validation logic on top.

If your agent's job is "capture data and put it somewhere"—booking a meeting, qualifying a lead, taking an order, scheduling a callback—tool calling reliability is what decides whether the agent actually does its job.

4. Pricing model and unit economics

This is the trap most teams fall into during pilots. Per-minute pricing looks cheap until you're running 500 simultaneous calls during a support spike. Per-token audio pricing (OpenAI Realtime) is unpredictable because audio output tokens are 10–20x text tokens and a chatty TTS voice burns through your budget.

A few patterns:

Flat hourly pricing : AssemblyAI's Voice Agent API at $4.50/hour covers STT, LLM inference, TTS, and tool calling. One bill, one line of math to model what a 5-minute call costs. No separate meters for audio in, audio out, text in, text out. No concurrency commitments. Easy to forecast.
Per-minute, all-in : Retell, Bland, ElevenLabs Conversational AI. Predictable, but adds up at scale.
Flat hourly with concurrency commitments : Deepgram's voice agent API is also ~$4.50/hour, but requires concurrency-metered billing—meaning you're committing to a certain number of simultaneous sessions. That changes the economics at scale.
Per-token audio : OpenAI Realtime. ~$18/hour with 30+ billing event types. Best for low-volume; hard to forecast at scale.
Pass-through + platform fee : Vapi, LiveKit. You pay each underlying provider plus a platform fee—flexible but more accounting overhead.

Forecast what 100 hours of conversation actually costs across the providers you're considering. The order of magnitude is real, especially once you stop being charged for demo calls and start being charged for production.

AssemblyAI Voice Agent API: one WebSocket, flat-rate, built on Universal-3 Pro Streaming

AssemblyAI's Voice Agent API is a single WebSocket that takes user audio in and streams agent audio out, with STT, LLM, TTS, turn detection, and tool calling handled inside the connection. It replaces three separate providers with one bill, one set of logs, and one set of latency variables to tune.

What makes it work as a speech-to-speech voice agent API:

Speech accuracy that survives phone audio. The STT layer is Universal-3 Pro Streaming, the same model trusted by enterprise voice agent teams for production deployments—307ms P50 latency, native 8kHz mulaw support, immutable transcripts, and a 16.7% alphanumeric missed error rate that's measurably better than OpenAI (23.3%) and Deepgram (25.5%). When the STT is this accurate, the whole conversation is better because the agent is actually responding to what was said.
Tool calling that doesn't go silent. Define your tools, the model calls them, results stream back into the conversation. Unlike OpenAI Realtime and Deepgram, the agent generates intermediate speech during tool execution—natural transition phrases like "Let me check on that"—so your users never hear dead air. Useful for the lead-qualification, appointment-setting, and structured-data-capture use cases where voice agents have the strongest product-market fit today.
Mid-session updates without reconnecting. Update the system prompt, voice, tools, and VAD settings mid-conversation with a JSON message—no reconnection, no redeployment. OpenAI Realtime only supports updating prompt and tools. Deepgram supports prompt and voice only. AssemblyAI is the only provider that lets you update all four mid-session.
Session resumption. If the WebSocket drops, reconnect within 30 seconds and pick up where the conversation left off. Context is preserved. Neither OpenAI Realtime nor Deepgram offers session resumption—a dropped connection means starting over.
Flat-rate pricing. $4.50/hour of session time, no per-token audio surprises, no per-provider invoices, no concurrency commitments. This includes STT, LLM, TTS, turn detection, and tool calling.
One API to learn. The Voice Agent API is one WebSocket. You don't wire together a streaming STT WebSocket, an LLM HTTP endpoint, a TTS streaming connection, and your own turn-detection logic. The plumbing is in the API.
Built for production. Unlimited concurrency, session resumption, structured logs per session, and the same SOC 2 / BAA-eligible infrastructure that already runs AssemblyAI's speech-to-text platform.

Where it fits in the landscape: AssemblyAI's Voice Agent API is the choice when speech accuracy decides whether the agent ships. If your agent is taking phone calls, capturing structured data, or operating in a regulated industry where you need a BAA, this is the speech-to-speech voice agent API to build on.

When to use a chained pipeline instead

A speech-to-speech voice agent API is the right answer for most teams in 2026. But there are three cases where chaining the layers yourself still wins:

You need a specific LLM : A frontier text LLM like Claude or Gemini that isn't exposed inside any S2S API yet. Most S2S APIs let you choose, but if you need a model that isn't on the list, chain it yourself.
You need a specific TTS voice : A cloned voice, a specific accent, or a non-standard language model. Most S2S APIs let you bring your own TTS, but if you need fine control, a chained pipeline is more flexible.
You have regulated data residency : Some industries require every layer to run in your VPC. A chained, self-hosted pipeline (with Bland for the orchestration, or fully self-built) is the only path.

If you're chaining, the layer that decides whether the agent works is still the streaming STT. The best streaming speech-to-text model for voice agents discussion comes down to the same accuracy and latency criteria covered above.

Common use cases for speech-to-speech voice agents

The pattern in 2026 is consistent: speech-to-speech voice agents work best on high-volume, structured calls where the agent's job is to capture or look up data rather than reason open-endedly. The teams shipping production agents converge on these use cases:

Lead qualification and outbound sales discovery : Ask BANT questions, book qualified meetings, sync to the CRM. Turn-taking quality is the differentiator.
Appointment scheduling and confirmations : Medical offices, salons, service businesses. Alphanumeric accuracy on dates, times, and confirmation codes is non-negotiable.
Food ordering and reservations : High-accuracy data capture on menu items, special requests, payment info.
Customer support tier-1 deflection : Order status, account questions, basic troubleshooting. Best paired with explicit escalation paths. See our guide to Voice AI for customer service.
Insurance verification and benefits lookup : Getting plan numbers, group IDs, and member info right the first time—the same accuracy bar that drives voice agents in healthcare.
Outbound reminders and surveys : Post-visit follow-ups, payment reminders, satisfaction surveys.

The common thread across all of these: the agent is capturing or retrieving specific data, the conversation has a predictable structure, and the cost of a transcription error is concrete. That's where a speech-to-speech voice agent API earns its keep over a human agent or an IVR.

How to evaluate a speech-to-speech voice agent API before you commit

Demos are unreliable. Vendor benchmarks are unreliable. Here's the evaluation loop teams actually use before signing a contract:

Record 50 real or representative calls for your use case, including accents, background noise, alphanumeric content, and interruptions.
Run them through each API's playground or trial. Measure word error rate (WER) on the alphanumeric tokens specifically—phone numbers, confirmation codes, emails, dollar amounts. General WER is misleading. Look at the missed entity rates: AssemblyAI sits at 16.7%, OpenAI at 23.3%, Deepgram at 25.5%. Run your own audio to see how those numbers hold on your data.
Time the turn-taking. Mark every "caller-stops-speaking" moment and measure how long until the agent starts responding. Sub-800ms is the threshold for natural-feeling conversation. Pay attention to how each provider handles turn detection—semantic + neural approaches outperform basic VAD on hesitant or accented speakers.
Test tool calling under load. Define three real tools and have the agent call them mid-conversation. Measure round-trip time and error rate. Also note whether the agent speaks naturally during tool execution or goes silent—this makes a bigger UX difference than most teams expect.
Read every transcript. You'll catch prompt failures, silently wrong transcriptions, and hallucinated tool parameters that you'd never notice by listening.

Most teams skip step 2 and ship with a model that fumbles confirmation codes silently. Don't.

Final words

The right speech-to-speech voice agent API in 2026 depends less on the marketing material and more on what your agent has to actually hear. If your users are on phones, capturing structured data, or operating in regulated environments, the bar is speech accuracy first, latency second, and pricing predictability third—in that order. The chained-architecture S2S APIs (with AssemblyAI's Voice Agent API as the leading example for accuracy-critical use cases) tend to outperform native speech-to-speech models on real-world telephony, even when the native models look better in studio-audio demos.

For most teams shipping a production voice agent this year, the AssemblyAI Voice Agent API is the right starting point. One WebSocket, $4.50/hour, Universal-3 Pro Streaming for the parts that matter, and flat-rate pricing you can forecast. Teams that need finer control over the stack can drop our Streaming Speech-to-Text product into their existing voice agent orchestrator.

Frequently asked questions

What is a speech-to-speech voice agent API?

A speech-to-speech voice agent API is a single API—usually a WebSocket—that accepts a user's audio stream and returns the agent's audio response. It hides the streaming speech-to-text, language model, tool calling, and text-to-speech behind one connection, so developers don't have to manage three separate providers, three API keys, or three latency budgets to ship a voice agent.

What is the best speech-to-speech voice agent API in 2026?

The best speech-to-speech voice agent API in 2026 is AssemblyAI's Voice Agent API for production deployments where speech accuracy matters—it's a single WebSocket built on Universal-3 Pro Streaming with 307ms P50 latency, native phone-audio support, tool calling, and flat $4.50/hour pricing. In our Voice Agent Report, 76% of builders rated transcription accuracy as the most important non-negotiable, and AssemblyAI delivers the lowest alphanumeric missed error rate (16.7%) compared to OpenAI (23.3%) and Deepgram (25.5%). OpenAI Realtime is competitive for browser-first demos. Retell is competitive for phone-first agents prioritizing turn-taking naturalness. The right choice depends on whether your users are on phones, what data the agent has to capture, and how predictable you need pricing to be.

How does a speech-to-speech voice agent API differ from chaining STT, LLM, and TTS yourself?

A speech-to-speech voice agent API gives you one API endpoint that takes audio in and returns audio out, with STT, LLM, TTS, turn detection, and tool calling handled inside the API. Chaining the layers yourself gives you full control over each component—choice of LLM, choice of TTS voice, on-prem deployment—but you own the plumbing: the WebSocket bridge, turn detection logic, retry handling, and three separate provider relationships. Most teams in 2026 default to a speech-to-speech voice agent API and only chain when they need a specific LLM, voice, or data residency setup.

Which speech-to-speech voice agent API is cheapest?

AssemblyAI's Voice Agent API at $4.50/hour flat-rate is the most predictable and one of the lowest unit costs in the category—one bill, no concurrency commitments, and you can model what a 5-minute call costs in one line of math. Per-minute APIs like Retell and ElevenLabs Conversational AI typically land between $0.07 and $0.30 per minute depending on tier, which works out to ~$4.20–$18/hour. Deepgram's voice agent API is also ~$4.50/hour but requires concurrency-metered billing, which changes the economics at scale. OpenAI Realtime runs ~$18/hour with per-token billing across 30+ event types—cheaper for low-volume but significantly more expensive and less predictable at scale.

Can I use a speech-to-speech voice agent API with Twilio?

Most speech-to-speech voice agent APIs can be bridged to Twilio Voice with a WebSocket server that forwards Twilio's 8kHz mulaw audio into the speech-to-speech API and streams the agent's audio response back as mulaw frames for Twilio to play. The cleanest setup uses an API that accepts mulaw natively at 8kHz—AssemblyAI's Voice Agent API and Universal-3 Pro Streaming both support this without resampling, which saves latency. Some providers like Retell ship a Twilio adapter directly.

Do speech-to-speech voice agent APIs support multiple languages?

Yes, but coverage varies widely. AssemblyAI's Voice Agent API launched with 6 streaming languages (English, Spanish, French, German, Italian, Portuguese) with native code-switching, and language coverage is expanding. OpenAI Realtime supports around 50 languages but has hallucination and language-switching issues mid-call. Google Gemini Live covers 30+. If you need a specific language combination, test with real audio in those languages before you commit—language support varies significantly between studio benchmarks and real-world phone audio.

How do I evaluate which speech-to-speech voice agent API is best for my use case?

Record 50 representative calls for your use case, run them through each API's playground or trial, and measure four things: word error rate on the entities that matter (phone numbers, confirmation codes, names, emails), end-to-end turn-taking latency, tool call round-trip time, and unit cost at your expected volume. General WER and marketing benchmarks are misleading—the only evaluation that predicts production behavior is the one that uses your audio, your tools, and your scale.

How to build a voice agent with Twilio and AssemblyAI

Mart Schweiger — Tue, 19 May 2026 19:27:30 +0000

Building a voice agent on Twilio with AssemblyAI takes one WebSocket server that bridges Twilio Voice Media Streams into Universal-3 Pro Streaming, your LLM of choice, and a text-to-speech model — all under an 800ms turn budget. This tutorial walks through every piece: the TwiML to open the audio stream, the FastAPI WebSocket bridge that handles 8kHz mulaw audio in both directions, the LLM loop with tool calling, and the deployment considerations that decide whether your agent feels human or obviously robotic on a real phone call.

By the end of this guide, you'll have a working inbound phone-based voice agent that answers a Twilio number, transcribes the caller in real time, calls tools (order lookup, callback scheduling, human transfer), and speaks back — all with code you can fork and ship today. The full repository is at the end of this post.

Why Twilio + AssemblyAI works for phone-based voice agents

Twilio is the most common telephony layer for voice agents because it handles the PSTN connection, gives you a phone number in minutes, and exposes the call audio as a Media Stream you can bridge into your own backend over a WebSocket. The audio comes in at 8kHz mulaw — the standard telephony format, not the 16kHz PCM most audio tools assume.

AssemblyAI's Universal-3 Pro Streaming model is built specifically for this. It accepts pcm_mulaw at sample_rate=8000 natively, so you don't pay the round-trip latency cost of resampling phone audio into 16kHz PCM and back. Combined with 307ms P50 latency, immutable transcripts, and 21% fewer alphanumeric errors than the previous generation of streaming speech-to-text models, it's the speech-to-text layer that decides whether your agent captures a confirmation code on the first try or makes the caller repeat it.

The architecture is straightforward:

  Caller's phone
       │
   Twilio Voice (PSTN)
       │  TwiML → open WebSocket
       ▼
  Your FastAPI server (this tutorial)
   ┌────┴────┐
   ▼         ▲
 AssemblyAI    ElevenLabs TTS
 Universal-3   (ulaw_8000 output)
 Pro Streaming
   │             ▲
   │ transcript  │ audio
   ▼             │
   GPT-4o + tool calling
     │
     └─► action + spoken reply

Audio flows in two directions continuously. Twilio sends inbound audio (caller → your server → AssemblyAI). Your server generates an LLM response, runs it through ElevenLabs, and streams the synthesized audio back to Twilio as mulaw frames. All of it stays inside one WebSocket per call.

Before you start

You'll need:

An AssemblyAI account with API key access to Universal-3 Pro Streaming
A Twilio account with a Voice-enabled phone number
An OpenAI API key (or another LLM provider)
An ElevenLabs API key (or another streaming TTS provider with mulaw output)
Python 3.11+
ngrok for exposing your local server to Twilio during development

Install the dependencies:

pip install fastapi uvicorn websockets python-dotenv openai elevenlabs twilio

Step 1: Configure the Twilio TwiML for an inbound call

When someone calls your Twilio number, Twilio fetches a TwiML document from your server and uses it to decide what to do with the call. To stream the call audio to your WebSocket, you return TwiML with a block:

# server.py
from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/twilio/voice")
async def twilio_voice(request: Request):
    host = request.url.hostname
    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://{host}/media-stream" />
  </Connect>
</Response>"""
    return Response(content=twiml, media_type="application/xml")

In the Twilio console, set the phone number's voice webhook to POST https://your-host/twilio/voice. When a call comes in, Twilio will hit this endpoint, parse the TwiML, and open a WebSocket to /media-stream that carries the call audio.

Step 2: Bridge Twilio Media Streams to Universal-3 Pro Streaming

This is the core of the agent. The WebSocket handler receives Twilio's audio frames, forwards them to AssemblyAI, listens for transcripts, and routes them into the LLM loop.

# server.py (continued)
import asyncio
import base64
import json
import os
import websockets
from fastapi import WebSocket

ASSEMBLY_WS = "wss://streaming.assemblyai.com/v3/ws"

@app.websocket("/media-stream")
async def media_stream(twilio_ws: WebSocket):
    await twilio_ws.accept()
    stream_sid = None

    # Open AssemblyAI streaming session — note: pcm_mulaw, 8kHz
aai_url = (
    f"{ASSEMBLY_WS}"
    f"?speech_model=u3-rt-pro"
    f"&encoding=pcm_mulaw"
    f"&sample_rate=8000"
)
aai_ws = await websockets.connect(
    aai_url,
    extra_headers={"Authorization": os.environ["ASSEMBLYAI_API_KEY"]},
)

    async def pump_twilio_to_aai():
        nonlocal stream_sid
        async for raw in twilio_ws.iter_text():
            event = json.loads(raw)
            if event["event"] == "start":
                stream_sid = event["start"]["streamSid"]
            elif event["event"] == "media":
                audio_b64 = event["media"]["payload"]
                # Twilio sends base64-encoded mulaw. AssemblyAI accepts raw bytes.
                await aai_ws.send(base64.b64decode(audio_b64))
            elif event["event"] == "stop":
                await aai_ws.close()
                return

    async def pump_aai_to_llm():
        async for message in aai_ws:
            data = json.loads(message)
            if data.get("type") == "Turn" and data.get("end_of_turn"):
                transcript = data.get("transcript", "").strip()
                if transcript:
                    await handle_user_turn(transcript, twilio_ws, stream_sid)

    await asyncio.gather(pump_twilio_to_aai(), pump_aai_to_llm())

The critical settings:

speech_model=u3-rt-pro selects Universal-3 Pro Streaming
encoding=pcm_mulaw and sample_rate=8000 tell AssemblyAI to expect raw mulaw without resampling
format_turns=true gives you properly cased and punctuated transcripts ready for the LLM

When end_of_turn is true, the caller has finished speaking and you have a complete utterance to send to the LLM.

Step 3: Run the LLM loop with tool calling

handle_user_turn is where the conversation logic lives. It takes the transcript, sends it to the LLM with the available tools, and either calls a tool or responds with text that becomes the agent's spoken reply.

# server.py (continued)
from openai import AsyncOpenAI

openai = AsyncOpenAI()

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up the status of a customer order by order ID.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string", "description": "e.g. AB3792"}
                },
                "required": ["order_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "transfer_to_human",
            "description": "Transfer the caller to a human agent.",
            "parameters": {
                "type": "object",
                "properties": {
                    "reason": {"type": "string"}
                },
                "required": ["reason"],
            },
        },
    },
]

conversation = [
    {
        "role": "system",
        "content": (
            "You are a friendly phone-based voice agent for a shoe retailer. "
            "Keep replies short — one or two sentences. "
            "Use get_order_status to look up orders. "
            "Use transfer_to_human if the caller asks for a person or is upset."
        ),
    }
]

async def handle_user_turn(transcript, twilio_ws, stream_sid):
    conversation.append({"role": "user", "content": transcript})
    response = await openai.chat.completions.create(
        model="gpt-4o",
        messages=conversation,
        tools=TOOLS,
        tool_choice="auto",
    )
    msg = response.choices[0].message

if msg.tool_calls:
    conversation.append(msg.model_dump())
    for call in msg.tool_calls:
        result = await dispatch_tool(call.function.name, call.function.arguments)
        conversation.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": result,
        })
    followup = await openai.chat.completions.create(
        model="gpt-4o", messages=conversation
    )
    reply = followup.choices[0].message.content
    else:
        reply = msg.content

    conversation.append({"role": "assistant", "content": reply})
    await speak(reply, twilio_ws, stream_sid)

The tool dispatcher is where your business logic lives. For a real deployment, replace the stubs with calls to your CRM, order management system, or scheduling backend.

Step 4: Stream the TTS audio back to Twilio as mulaw

Twilio expects audio frames as base64-encoded mulaw at 8kHz. ElevenLabs supports a ulaw_8000 output format that produces exactly this — which means no resampling, no conversion, just stream the bytes back.

# server.py (continued)
from elevenlabs.client import AsyncElevenLabs

eleven = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

async def speak(text, twilio_ws, stream_sid):
    audio_stream = eleven.text_to_speech.stream(
        voice_id=os.environ.get("ELEVENLABS_VOICE_ID", "EXAVITQu4vr4xnSDxMaL"),
        text=text,
        model_id="eleven_turbo_v2_5",
        output_format="ulaw_8000",
    )
    async for chunk in audio_stream:
        payload = base64.b64encode(chunk).decode()
        await twilio_ws.send_text(json.dumps({
            "event": "media",
            "streamSid": stream_sid,
            "media": {"payload": payload},
        }))

Each chunk gets streamed to Twilio as a media event. Twilio plays the audio to the caller as it arrives, which means the caller hears the first word of the agent's reply while the rest is still being synthesized.

Step 5: Run it and connect Twilio

Start your server and expose it through ngrok:

uvicorn server:app --port 8000

ngrok http 8000

Copy the https://*.ngrok-free.dev URL ngrok prints. In the Twilio console:

Buy or pick a Voice-enabled phone number
Open the number's configuration
Under "A call comes in," set the webhook to https://your-ngrok-url/twilio/voice with method POST
Save

Call the number from your phone. You should hear the agent pick up and respond in natural conversation.

Latency budget: where your milliseconds go

A natural-feeling phone agent answers in under 800ms from when the caller stops speaking to when the caller hears the first audio of the reply. Here's where that budget gets spent on a Twilio + AssemblyAI stack:

Stage	Typical latency
AssemblyAI end-of-turn finalization	~150–250ms
LLM first-token generation (GPT-4o)	~200–400ms
TTS first-byte (ElevenLabs streaming)	~200–400ms
Twilio round-trip	~50–100ms
Total perceived latency	~600–1100ms

Three things blow the budget the moment you stop being careful:

Resampling audio. Anything that converts 8kHz mulaw to 16kHz PCM (and back) costs 50–150ms each way. AssemblyAI's Universal-3 Pro Streaming and ElevenLabs's ulaw_8000 output both keep audio in mulaw end-to-end.
Non-streaming LLMs. Waiting for the full response before TTS starts is a guaranteed dead zone. Stream tokens from the LLM and chunk them to TTS sentence-by-sentence.
Cold-start tools. A tool call that hits a slow database eats your entire turn. Cache hot data and aggressively timeout slow lookups.

What about the AssemblyAI Voice Agent API?

If your voice agent doesn't need Twilio specifically — for example a browser-based assistant, a mobile app, or an embedded device — the Voice Agent API wraps STT, LLM, TTS, turn detection, and tool calling behind a single WebSocket at a flat $4.50/hour (announcement). You skip the three-provider plumbing entirely.

For Twilio-bridged phone calls today, the chained architecture in this tutorial is still the most flexible path — it lets you pick exactly the LLM, TTS voice, and tool definitions you want. The Voice Agent API is the right choice for everything that isn't a PSTN inbound call, and Twilio integration through the Voice Agent API is on the roadmap.

The complete repository

Fork the runnable repo at github.com/kelsey-aai/twilio-voice-agent-assemblyai. It includes the FastAPI server, tool dispatcher, sample tools (get_order_status, transfer_to_human), a .env.example, and ngrok setup instructions. Total length: ~250 lines of Python.

Frequently asked questions

How do I build a voice agent with Twilio and AssemblyAI?

To build a voice agent with Twilio and AssemblyAI, point your Twilio phone number at a TwiML endpoint that opens a to your server's WebSocket. In the WebSocket handler, forward Twilio's 8kHz mulaw audio frames to AssemblyAI's Universal-3 Pro Streaming API using encoding=pcm_mulaw and sample_rate=8000. When AssemblyAI returns a finalized turn, pass the transcript to an LLM (GPT-4o, Claude) with your tool definitions — see our function calling tutorial for a deeper walkthrough — then stream the LLM's reply through a TTS model that supports ulaw_8000 output (like ElevenLabs) back to Twilio as base64-encoded media events.

Why use AssemblyAI for a Twilio voice agent?

AssemblyAI's Universal-3 Pro Streaming model is built for the audio Twilio actually sends — 8kHz mulaw — without requiring resampling, which costs latency. For an overview of the broader category, see AI voice agents in 2026. It delivers 307ms P50 latency, immutable transcripts your downstream LLM can trust, and 21% fewer alphanumeric errors than the previous generation, which matters when the agent is capturing confirmation codes, phone numbers, or email addresses over a phone line.

Does the Voice Agent API work with Twilio?

The AssemblyAI Voice Agent API is the simplest path for voice agents that don't need Twilio specifically — a single WebSocket replaces STT, LLM, and TTS at $4.50/hour. Native Twilio integration through the Voice Agent API is on the roadmap. Today, the chained architecture in this tutorial (Universal-3 Pro Streaming + your LLM + your TTS, bridged through a Twilio Media Streams WebSocket) is the standard path for Twilio-based phone agents.

What latency should I expect from a Twilio voice agent?

A well-tuned Twilio voice agent built on AssemblyAI Universal-3 Pro Streaming, GPT-4o, and ElevenLabs typically hits 600–1100ms from caller-stops-talking to caller-hears-reply. The biggest latency killers are resampling audio (use native mulaw end-to-end), non-streaming LLM responses (stream tokens), and slow tool calls (cache and timeout aggressively).

How much does it cost to run a phone-based voice agent?

The cost breaks down across four components: Twilio voice (per-minute, varies by country), AssemblyAI Universal-3 Pro Streaming ($0.15/hour of session time), the LLM (varies by provider — typically a few cents per minute of conversation for GPT-4o), and TTS (per-character or per-minute). End-to-end you're looking at a few cents per minute at scale, with the exact number driven by which LLM and TTS you choose.

Can a Twilio voice agent handle multiple simultaneous calls?

Yes. AssemblyAI's Universal-3 Pro Streaming supports unlimited concurrent streams at a flat $0.15/hour with no separate negotiation required. Twilio handles concurrency per-account based on your plan. The constraint at scale is usually your own server's WebSocket concurrency limits — FastAPI with uvicorn workers handles hundreds of concurrent calls comfortably on modest hardware.

Build an AI voice agent for customer support that can look up orders

Mart Schweiger — Tue, 19 May 2026 19:27:21 +0000

Tier-1 customer support is mostly the same five conversations on repeat: where's my order, can I change my address, can I get a refund, when does this ship, can I talk to a human. They're predictable, they're high-volume, and they don't need a person — they need a voice agent that can actually look things up.

This tutorial walks you through building one. By the end, you'll have a Python voice agent that answers calls, listens for an order ID or email, calls into your backend to check the status, and reads the result back to the customer in real time. When something goes off-script, it transfers to a human with the full conversation context attached.

We're using AssemblyAI's Voice Agent API — one WebSocket that handles the speech understanding, LLM reasoning, voice generation, turn detection, and tool calling in a single connection. Total time to a working prototype: about an afternoon.

Why most support voice agents fail

Before we build, it's worth knowing where these things break. The pattern is almost always the same:

Customer says "my order ID is A-B-3-7-9-2"
STT mishears it as "a b 37 92" or "ABE 379 to"
The LLM calls get_order_status("ab3792") or worse, asks the customer to repeat
Customer hangs up

The agent didn't fail because the LLM was wrong. It failed because the speech-to-text layer couldn't capture the entity correctly. This is why entity accuracy on alphanumerics, emails, and phone numbers matters more than overall WER for support agents — and why we're building on Universal-3 Pro Streaming, which has a 16.7% mixed-entity error rate vs. 23–25% for competing models.

The second-most-common failure: dead air during tool calls. The customer asks a question, the agent calls a backend, and there's a 2–3 second silence while the lookup runs. The Voice Agent API solves this by speaking a natural transition phrase ("let me check that for you") while the tool runs — no dead air, no awkward pauses.

What you'll build

A Python voice support agent that handles three real workflows:

Order status lookup — customer says "where's my order?" → agent asks for the ID → looks it up → reads back status, ETA, tracking number
Customer info verification — customer provides email or phone number → agent looks up the account → confirms identity before proceeding
Human escalation — customer asks for a person, or the agent gets stuck → graceful transfer with conversation context preserved

Stack:

AssemblyAI Voice Agent API (one WebSocket: STT + LLM + TTS)
Python 3.9+
A backend with order data — we'll mock it; replace with your real CRM or order management system

Setup

pip install "websockets>=14" pyaudio python-dotenv

Create .env:

ASSEMBLYAI_API_KEY=your_key_here

The Voice Agent API uses a single endpoint: wss://agents.assemblyai.com/v1/ws. One key, one connection, no separate STT or TTS providers to wire in.

Step 1: Define the support tools

Tools are the agent's interface to your backend. The Voice Agent API uses standard JSON Schema, so anything you can describe with a schema, the agent can call.

For a support agent, you typically want four tools:

import json

TOOLS = [
    {
        "type": "function",
        "name": "get_order_status",
        "description": "Look up an order's current status, shipping ETA, and 
tracking number by order ID.",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The customer's order ID, e.g. ORD-12345 or
78231-ABC.",
                },
            },
            "required": ["order_id"],
        },
    },
    {
        "type": "function",
        "name": "lookup_account_by_email",
        "description": "Find a customer account using their email address.",
        "parameters": {
            "type": "object",
            "properties": {
                "email": {"type": "string", "description": "The customer's email
address."},
            },
            "required": ["email"],
        },
    },
    {
        "type": "function",
        "name": "list_recent_orders",
        "description": "List the customer's most recent orders. Use after the 
account is verified.",
        "parameters": {
            "type": "object",
            "properties": {
                "account_id": {"type": "string"},
                "limit": {"type": "integer", "description": "Max number of orders 
to return.", "default": 5},
            },
            "required": ["account_id"],
        },
    },
    {
        "type": "function",
        "name": "transfer_to_human",
        "description": "Transfer the call to a human agent. Use when the customer 
asks, when you can't help, or when the issue is sensitive.",
        "parameters": {
            "type": "object",
            "properties": {
                "reason": {"type": "string", "description": "Short reason for the 
transfer."},
                "summary": {"type": "string", "description": "Brief summary of the 
conversation so far."},
            },
            "required": ["reason", "summary"],
        },
    },
]

Now implement the actual functions. Replace these stubs with calls to your real backend:

ORDERS_DB = {
    "ORD-12345": {"status": "shipped", "eta": "2026-05-09", "tracking": 
"1Z999AA10123456784"},
    "ORD-67890": {"status": "processing", "eta": "2026-05-12", "tracking": None},
}

ACCOUNTS_DB = {
    "jane@example.com": {"account_id": "ACC-001", "name": "Jane Doe"},
}

ACCOUNT_ORDERS = {
    "ACC-001": [
        {"order_id": "ORD-12345", "date": "2026-05-01", "total": "$84.99"},
        {"order_id": "ORD-12100", "date": "2026-04-22", "total": "$42.00"},
    ],
}

def run_tool(name: str, args: dict) -> dict:
    if name == "get_order_status":
        order = ORDERS_DB.get(args["order_id"].upper())
        if not order:
            return {"error": "order_not_found", "order_id": args["order_id"]}
        return order

    if name == "lookup_account_by_email":
        account = ACCOUNTS_DB.get(args["email"].lower())
        if not account:
            return {"error": "account_not_found"}
        return account

    if name == "list_recent_orders":
        orders = ACCOUNT_ORDERS.get(args["account_id"], [])
        return {"orders": orders[: args.get("limit", 5)]}

    if name == "transfer_to_human":
        # In production: trigger your call routing / queue handoff here
        return {"transferred": True, "queue": "support-tier-2"}

    return {"error": f"unknown_tool: {name}"}

The error-shape pattern matters. When get_order_status can't find an order, it returns a structured error rather than throwing — that gives the LLM the context it needs to apologize and ask the customer to verify the ID, instead of crashing the conversation.

Step 2: Write the system prompt

The system prompt is where you encode the agent's behavior. For support, you want a few things every time: identity and tone, when to ask for verification before sharing details, when to use which tool, when to transfer to a human, and specific phrasing for transition moments (the "let me check that" line).

SYSTEM_PROMPT = """
You are Avery, a customer support agent for Acme Corp. Your goal is to help c
ustomers
quickly and accurately. You have access to tools that let you look up orders and
accounts.

Behavior rules:
- Greet warmly and ask how you can help.
- For order questions, ask for the order ID first if the customer hasn't given it.
- If a customer gives an email or phone number, use lookup_account_by_email to
verify.
- Read order status, ETA, and tracking number clearly. Don't read raw timestamps —
  say dates naturally (e.g., "Friday, May 9th").
- When you need to call a tool, say a brief transition like "Let me check on that"
  or "One moment while I pull that up."
- If the customer asks for a human, sounds frustrated, or has a complex issue
  (refund disputes, damaged product, billing errors), use transfer_to_human and
  include a short summary.
- Never make up an order ID, status, or tracking number. If a tool returns an 
error,
  apologize, ask the customer to verify the ID, and try again.
- Keep replies short and conversational. This is a phone call, not an email.
"""

The "never make up" line is the most important sentence in the prompt. Without it, LLMs sometimes invent plausible-sounding tracking numbers when the lookup fails. With it, they ask for clarification instead.

Step 3: Connect to the Voice Agent API

Now the WebSocket connection. The pattern is:

Open wss://agents.assemblyai.com/v1/ws with your API key
Send session.update with the system prompt, tools, voice, and greeting
Wait for session.ready, then start streaming microphone audio

Handle incoming events — tool.call, reply.audio, transcript.user, reply.done

import asyncio
import websockets
import os
import pyaudio

API_KEY = os.getenv("ASSEMBLYAI_API_KEY")
WS_URL = "wss://agents.assemblyai.com/v1/ws"
SAMPLE_RATE = 24000

async def run_agent():
    async with websockets.connect(
        WS_URL,
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        # Configure the agent
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": SYSTEM_PROMPT,
                "greeting": "Hi, this is Avery from Acme support. How can I
help?",
                "output": {"voice": "ivy"},
                "tools": TOOLS,
            },
        }))

        # Set up microphone capture and speaker playback
        pa = pyaudio.PyAudio()
        mic = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                      input=True, frames_per_buffer=1024)
        speaker = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                          output=True)

        ready = asyncio.Event()
        pending_tools = []

        async def send_audio():
            await ready.wait()
            import base64
            while True:
                audio = mic.read(1024, exception_on_overflow=False)
                await ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": base64.b64encode(audio).decode(),
                }))
                await asyncio.sleep(0)

        async def handle_messages():
            async for raw in ws:
                event = json.loads(raw)
                t = event.get("type")

                if t == "session.ready":
                    ready.set()
                    print("Agent ready. Start speaking.")

                elif t == "transcript.user":
                    print(f"\nUser: {event['text']}")

                elif t == "transcript.agent":
                    print(f"Agent: {event['text']}")

                elif t == "reply.audio":
                    import base64
                    speaker.write(base64.b64decode(event["data"]))

                elif t == "tool.call":
                    name = event["name"]
                    args = event.get("arguments", {})
                    print(f"  [tool] {name}({args})")
                    result = run_tool(name, args)
                    pending_tools.append({"call_id": event["call_id"], "result": 
result})

                elif t == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                    elif pending_tools:
                        for tool in pending_tools:
                            await ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": json.dumps(tool["result"]),
                            }))
                        pending_tools.clear()

        await asyncio.gather(send_audio(), handle_messages())

if __name__ == "__main__":
    asyncio.run(run_agent())

A few details that the docs flag and you'd otherwise debug for an hour:

Don't send tool.result immediately when you receive tool.call. Accumulate results and send them inside the reply.done handler. Sending too early causes timing issues.
Discard pending tool results on interruption. If the user speaks while the agent is generating a transition phrase, you'll get reply.done with status: "interrupted" — clear the buffer and wait for the next turn.
Voice names are case-sensitive. Use lowercase: ivy, claire, dawn. An invalid voice returns session.error.

Step 4: Test the three workflows

Run the script and walk through each support scenario. You should hear:

Workflow 1 — Order lookup:

 You: "Hi, I'm trying to check on order O-R-D 1-2-3-4-5"
Agent: "Sure, let me check on that... I see order ORD-12345. It shipped and is
        on its way — you should have it by Friday, May 9th. The tracking number
        is 1Z999AA10123456784."

Workflow 2 — Email-based account lookup:

 You: "I forgot my order ID. Can you look me up by email?"
Agent: "Of course. What's the email on the account?"
You: "It's jane at example dot com."
Agent: "One moment... Got it, you're Jane Doe. I see two recent orders:
        ORD-12345 from May 1st for $84.99, and ORD-12100 from April 22nd
        for $42.00. Which one are you asking about?"

Workflow 3 — Human transfer:

 You: "I just want to talk to a person."
Agent: "I understand. Let me get you over to a teammate now."
[tool.call: transfer_to_human({"reason": "user requested human", "summary": "..."})]

Speak the order ID with hesitation, mumbles, accents, and natural disfluencies — that's where Universal-3 Pro Streaming earns its keep. The agent should still extract the ID correctly because it's tuned for the alphanumeric tokens that voice agents act on.

Step 5: Take it to the phone

This works in your browser through your microphone, but real customer support runs on phones. Twilio Media Streams is the standard bridge — your server accepts the inbound call from Twilio and opens a parallel connection to the Voice Agent API, forwarding audio in both directions.

The Voice Agent API supports audio/pcmu (G.711 μ-law at 8 kHz) natively, which matches Twilio's codec exactly. No transcoding, no resampling. The Twilio integration guide walks through the full bridge in about 100 lines of TypeScript.

What to harden before production

Three things you'll want to nail down before pointing this at real customers:

Replace the in-memory mocks with calls to your actual CRM or order management system. Add timeouts and error handling so a slow backend doesn't kill the conversation.
Log everything. Save user transcripts, tool calls, results, and the agent's responses tied to a session ID. Conversation logs are your debugging tool when something goes wrong on call #4,712.
Tune turn detection for your acoustic environment. The defaults work for most use cases. For phone audio with background noise, you may want to raise min_end_of_turn_silence_ms slightly so the agent doesn't cut off thoughtful pauses.

Where to go from there

Once the basic order-lookup loop works, the same tool-calling pattern extends to every other support workflow you have: cancel an order, update a shipping address, request a refund, schedule a callback, fetch FAQ answers from a knowledge base. Add the function, describe it in the system prompt, and the agent picks it up — no new infrastructure.

The compounding win: every conversation goes through the same Voice Agent API connection, the same transcription model, the same billing relationship. You're not assembling a new vendor stack; you're adding tools to an agent that already works.

Try the Voice Agent API live on the product page — it's the same API you'd ship with — or grab a free API key with $50 in starter credits and have your first agent answering calls by end of day.

Frequently asked questions

How do I build an AI voice agent for customer support that can look up orders?

Build it on AssemblyAI's Voice Agent API, register a get_order_status function as a tool with JSON Schema, and connect to the WebSocket at wss://agents.assemblyai.com/v1/ws. The agent transcribes the customer's speech, decides when to call your function, executes it through your backend, and speaks the result back — all on a single connection. Most developers ship a working agent in an afternoon because there's no SDK to learn and no separate STT, LLM, or TTS providers to wire together.

Why does speech-to-text accuracy matter so much for support voice agents?

Support agents constantly need to capture alphanumeric tokens — order IDs, account numbers, email addresses, phone numbers — and a single transcription error breaks the workflow. If the STT layer mishears "ORD-12345" as "or 12 three 45," your get_order_status function gets a garbled ID and returns nothing. AssemblyAI's Voice Agent API is built on Universal-3 Pro Streaming, which has a 16.7% mixed-entity error rate vs. 23–25% for competing models — that's the difference between tool calls that succeed and tool calls that silently fail.

How does tool calling work with the AssemblyAI Voice Agent API?

You register tools by passing an array of function definitions in session.tools on a session.update event. When the agent decides to call a tool, it emits a tool.call event with the function name and arguments. You execute the function and accumulate results, then send tool.result events inside your reply.done handler — not immediately on tool.call. While the tool runs, the agent speaks a brief transition phrase like "let me check that for you" so the conversation never goes silent.

Can I connect AssemblyAI's Voice Agent API to phone calls with Twilio?

Yes — the Voice Agent API supports audio/pcmu (G.711 μ-law at 8 kHz) natively, which matches Twilio's codec exactly with no transcoding needed. You set up a server that accepts the inbound Twilio Media Streams call, opens a parallel WebSocket to the Voice Agent API, and forwards audio in both directions. The official Twilio integration guide walks through inbound and outbound calling in about 100 lines of TypeScript.

What's the best way to handle escalation to a human in a customer support voice agent?

Register a transfer_to_human tool with parameters for reason and summary, and instruct the agent in the system prompt to call it when the customer asks for a person, sounds frustrated, or has a complex issue (refund disputes, billing errors, damaged products). The agent generates a short summary of the conversation that you forward to your human queue, so the receiving agent doesn't have to ask the customer to repeat themselves. This is one of the most important workflows to design well — a poor handoff feels worse than no AI at all.

How much does it cost to run a customer support voice agent on AssemblyAI?

The Voice Agent API is $4.50/hr flat — covering speech understanding, LLM reasoning, voice generation, turn detection, and tool calling all in one bill. There are no per-token surcharges, no concurrency caps, and no separate invoices for STT, LLM, and TTS providers. Pricing is billed by the minute on actual conversation duration, and a free tier is available for testing.

Do voice agents built with AssemblyAI work with healthcare workflows subject to HIPAA?

Yes — AssemblyAI offers a Business Associate Addendum (BAA) for customers processing protected health information (PHI) and is SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certified. For clinical use cases (medical front-office voice agents, healthcare contact centers), enable Medical Mode with domain="medical-v1" to improve transcription accuracy on medication names, procedures, conditions, and dosages. Do not point the agent at real PHI without a signed BAA in place.

Build a real-time voice AI agent in Python with the AssemblyAI Voice Agent API

Mart Schweiger — Tue, 19 May 2026 19:26:37 +0000

You can build a working real-time voice agent in Python in well under 100 lines of code if you use the right primitive. This tutorial walks through building one on the AssemblyAI Voice Agent API — a single WebSocket that wraps streaming speech-to-text, an LLM, text-to-speech, turn detection, and tool calling at $4.50/hour flat. No three-provider pipeline to wire up, no separate STT WebSocket plus LLM HTTP plus TTS stream to coordinate. Audio in, audio out, tool calls in between.

By the end of this guide, you'll have a runnable Python voice agent that listens through your microphone, holds a real conversation, and calls Python functions to take actions. The companion repository is linked at the end. If you'd rather chain streaming STT, an LLM, and a TTS provider yourself, our Python voice agent tutorial covers that path, or see the 5-minute Voice Agent API quickstart for an even faster path.

Why use the Voice Agent API for a Python voice agent

The traditional "voice agent in Python" tutorial wires together a streaming STT API, an LLM HTTP endpoint, and a TTS streaming connection — three providers, three sets of credentials, three sets of latency variables to tune, and your own turn detection logic to write. The result works, but it's a lot of plumbing.

The Voice Agent API replaces all of that with one WebSocket. You connect once, send audio frames, and receive both audio output and tool call events on the same stream. Three properties make it useful for production Python voice agents:

One bill, one set of logs. $4.50/hour of session time covers STT, LLM inference, TTS, turn detection, and tool calling. You're not pasting three invoices into a cost spreadsheet.
Speech accuracy that works on real audio. Universal-3 Pro Streaming sits underneath — 307ms P50 latency, immutable transcripts, native 8kHz mulaw support for telephony, and 21% fewer alphanumeric errors than the previous generation of streaming STT.
Tool calling that maps to Python functions cleanly. Define tools as JSON schemas, the LLM calls them, results stream back into the conversation. No separate function-calling API or LLM provider to manage.

Architecture

  Microphone
     │  PCM16 24kHz mono
     ▼
  Your Python script
     │  WebSocket: input.audio frames
     ▼
  AssemblyAI Voice Agent API
   ┌────────────────────────────────┐
   │  STT + Turn detection           │
   │      ↓                          │
   │  LLM + tool calling             │
   │      ↓                          │
   │  TTS                            │
   └────────────────────────────────┘
     │
     │  WebSocket: reply.audio + tool.call events
     ▼
  Your Python script
     ├─► Speaker playback
     └─► Dispatch tool calls back to LLM

Audio flows in two directions on the same WebSocket. Your script captures mic audio, base64-encodes it, and sends it as input.audio events. The API returns audio playback chunks as reply.audio events and structured tool.call events when the LLM decides to invoke one of your tools. You dispatch the tool, send back a tool.result, and the conversation continues.

Before you start

You'll need:

An AssemblyAI account with Voice Agent API access
Python 3.11+
A working microphone and speakers (use headphones for clean barge-in — desktop mics pick up the agent's own voice and cause it to interrupt itself)
portaudio installed system-wide (brew install portaudio on macOS, apt install portaudio19-dev on Debian/Ubuntu)

Install the dependencies:

pip install "websockets>=14" python-dotenv pyaudio

Drop your API key into a .env file:

ASSEMBLYAI_API_KEY=your_key_here

Step 1: Capture microphone audio

PyAudio captures raw PCM audio. The Voice Agent API's default audio/pcm encoding is 24 kHz, 16-bit, mono — the audio format docs recommend ~50 ms chunks for low latency.

# audio.py
import threading
from queue import Queue
import pyaudio

SAMPLE_RATE = 24000
CHUNK_SIZE = 1200  # 50ms at 24kHz 16-bit mono

class Mic:
    def __init__(self):
        self._pa = pyaudio.PyAudio()
        self.queue = Queue()
        self._running = False

    def start(self):
        self._running = True
        self._stream = self._pa.open(
            format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
            input=True, frames_per_buffer=CHUNK_SIZE,
        )
        threading.Thread(target=self._loop, daemon=True).start()

    def _loop(self):
        while self._running:
            self.queue.put(self._stream.read(CHUNK_SIZE, 
exception_on_overflow=False))

    def stop(self):
        self._running = False
        self._stream.stop_stream(); self._stream.close()
        self._pa.terminate()

class Speaker:
    def __init__(self):
        self._pa = pyaudio.PyAudio()
        self._stream = self._open()

    def _open(self):
        return self._pa.open(
            format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE, output=True,
        )

    def play(self, audio_bytes):
        self._stream.write(audio_bytes)

    def flush_and_restart(self):
        # Called on barge-in: drop any queued speech and reopen the stream.
        try:
            self._stream.stop_stream(); self._stream.close()
        except Exception:
            pass
        self._stream = self._open()

    def close(self):
        self._stream.stop_stream(); self._stream.close()
        self._pa.terminate()

Step 2: Open the Voice Agent API session

The Voice Agent API connection starts with a session.update message that declares your system prompt, the tools you want available, the agent's voice, and an opening greeting. The API picks audio/pcm (24 kHz) by default, so you don't need to specify input/output format explicitly.

# agent.py
import asyncio, base64, json, os
import websockets
from dotenv import load_dotenv

from audio import Mic, Speaker
from tools import TOOLS, dispatch_tool

load_dotenv()

VOICE_AGENT_WS = "wss://agents.assemblyai.com/v1/ws"

SYSTEM_PROMPT = """You are a helpful voice assistant.
Keep replies short and conversational — one or two sentences.
Use the available tools to answer questions when relevant."""

async def open_session(ws):
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "system_prompt": SYSTEM_PROMPT,
            "greeting": "Hi! How can I help?",
            "tools": TOOLS,
            "output": {"voice": "ivy"},
        },
    }))

A few details worth flagging up front, because they're the easy ones to get wrong:

The auth header for the Voice Agent API uses Authorization: Bearer YOUR_KEY — note the Bearer prefix. This is different from every other AssemblyAI endpoint, which accepts the raw API key with no prefix.
The first message you send is session.update, not session.start. All config nests under a session object.
The voice field is a named voice from the Voice Agent API catalog (e.g. ivy, james, sophie) — not an ElevenLabs voice ID. See the voices reference for the full list.
You must wait for the server's session.ready event before sending any audio.

Step 3: Pump audio in, route events out

Two coroutines run concurrently: one sends mic chunks once the session is ready, the other reads events as they arrive.

async def run_agent():
    mic = Mic()
    speaker = Speaker()

    async with websockets.connect(
        VOICE_AGENT_WS,
        additional_headers={"Authorization": f"Bearer 
{os.environ['ASSEMBLYAI_API_KEY']}"},
    ) as ws:
        await open_session(ws)

        ready = asyncio.Event()
        pending_tools = []
        loop = asyncio.get_event_loop()

        async def send_audio():
            await ready.wait()
            mic.start()
            while True:
                chunk = await loop.run_in_executor(None, mic.queue.get)
                await ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": base64.b64encode(chunk).decode(),
                }))

        async def receive():
            async for raw in ws:
                event = json.loads(raw)
                kind = event["type"]

                if kind == "session.ready":
                    ready.set()
                    print(f"Session ready: {event.get('session_id')}")

                elif kind == "reply.audio":
                    speaker.play(base64.b64decode(event["data"]))

                elif kind == "tool.call":
                    # Accumulate — flush on reply.done, not now.
                    result = dispatch_tool(event["name"], event.get("arguments",
{}))
                    pending_tools.append({"call_id": event["call_id"], "result":
result})

                elif kind == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                        speaker.flush_and_restart()
                    elif pending_tools:
                        for tool in pending_tools:
                            value = tool["result"]
                            if not isinstance(value, str):
                                value = json.dumps(value)
                            await ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": value,
                            }))
                        pending_tools.clear()

                elif kind == "transcript.user":
                    print(f"You:   {event['text']}")

                elif kind == "transcript.agent":
                    print(f"Agent: {event['text']}")

        await asyncio.gather(send_audio(), receive())

That's the entire voice agent loop. The Voice Agent API handles every layer of the pipeline (STT, LLM, TTS, turn detection) inside the WebSocket. Your job is to feed it audio, play what comes back, and dispatch tool calls.

Two more easy-to-miss details:

Tool result timing. Per the tool calling docs, accumulate tool results when tool.call fires and send them inside the reply.done handler — not immediately. The agent generates a short transition phrase ("let me check on that") while the tools run; sending results too early can cause timing issues.
Interruption handling. When the user barges in, the server sends reply.done with status: "interrupted". Drop any queued tool results and flush the speaker so the caller doesn't keep hearing the previous reply.

Step 4: Implement the tools

The dispatch_tool function is where your agent does real work. The Voice Agent API delivers tool.call events with arguments already parsed as a Python dict — no json.loads() needed.

# tools.py
TOOLS = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
    {
        "type": "function",
        "name": "remember",
        "description": "Save something the user wants you to remember.",
        "parameters": {
            "type": "object",
            "properties": {"fact": {"type": "string"}},
            "required": ["fact"],
        },
    },
]

_memory = []

def dispatch_tool(name, args):
    if name == "get_weather":
        # In production: call a real weather API.
        return f"It's 68°F and partly cloudy in {args['city']}."
    if name == "remember":
        _memory.append(args["fact"])
        return f"Got it. I'll remember: {args['fact']}"
    return f"Unknown tool: {name}"

The "type": "function" field on each tool is required. Forget it and the API will reject the session.update with a validation error.

In production, replace the stubs with calls to a real weather API, your CRM, a database, or whatever your application actually does. The tool dispatcher is pure Python — anything you can do from a Python function, the voice agent can do.

Step 5: Run it

python agent.py

The agent greets you. Try:

"What's the weather in San Francisco?"
"Remember that my passport expires in March."
"What did I just ask you to remember?"

The full flow: your speech → STT → LLM (with tools available) → tool call (if applicable) → tool result → LLM continues → TTS → speaker. All in under a second, on one WebSocket.

Latency: getting under 500ms perceived

A natural-feeling voice agent responds in under 800ms from when you stop talking to when you hear the reply. Best-in-class teams target sub-500ms. Where your milliseconds go on the Voice Agent API:

Stage	Typical latency
Mic chunk → server	~50–100ms
End-of-turn detection	~100–200ms
LLM first-token	~200–400ms
TTS first-byte → speaker	~100–250ms
Perceived total	~450–950ms

The Voice Agent API streams audio output as it's generated, so the user hears the first word of the reply while the rest is still being synthesized. The biggest latency wins on the Python side:

Don't buffer mic audio. Send 50ms chunks as they arrive — that's what the audio.py example does.
Don't block in the tool dispatcher. If a tool call takes more than 500ms, the silence becomes audible. Cache hot data, set aggressive timeouts, and consider returning a placeholder ("Let me check on that") while the real call resolves.
Use the streaming audio output. Play reply.audio chunks as they arrive; never wait for the full response.

Handling interruptions

Real conversations include interruptions. The user changes their mind, asks a follow-up while the agent is still talking, or says "wait, no, the other one." The Voice Agent API handles this server-side: barge-in is semantic — back-channels like "uh-huh" don't trigger an interruption, but "wait, stop" does.

When the user actually interrupts, the server sends reply.done with status: "interrupted" (and transcript.agent with interrupted: true and the trimmed text). Your client should flush any queued speaker audio and drop any pending tool results, exactly as shown in the receive() loop above.

Going to production

The agent above runs against your local microphone. To deploy it, swap the audio transport:

Phone calls (PSTN) — Bridge through Twilio Media Streams. The Voice Agent API supports audio/pcmu (G.711 μ-law at 8 kHz) natively, so phone audio stays in μ-law end-to-end with no resampling. See our our LiveKit voice agent guide if you'd rather use an orchestrator.
Web apps — Capture audio in the browser with AudioWorklet, then stream it to the Voice Agent API. See Browser integration for the temporary-token flow that keeps your API key off the client.
Mobile — Same pattern. The native audio capture APIs (iOS AVAudioEngine, Android AudioRecord) emit PCM you can forward through your server.

For all production deployments, add:

Session persistence (save the session_id from session.ready and use session.resume to reconnect within 30 seconds without losing context)
Per-session structured logs (user transcript, agent transcript, tool calls, tool results)
PII redaction on transcripts before they hit your warehouse
A timeout-and-retry policy for tool calls so a slow backend doesn't kill the call

The complete repository

Fork the runnable Python repo at github.com/kelsey-aai/python-voice-agent-api. It includes mic capture, speaker playback, the WebSocket loop, the tool dispatcher, and example tools you can swap for your own. Around 200 lines of Python end-to-end.

Frequently asked questions

How do I build a real-time voice agent in Python?

The fastest way to build a real-time voice agent in Python in 2026 is to open a WebSocket to the AssemblyAI Voice Agent API at wss://agents.assemblyai.com/v1/ws, stream microphone audio in as input.audio events, and play the reply.audio events you get back. The Voice Agent API handles streaming speech-to-text, the LLM, text-to-speech, turn detection, and tool calling on a single connection at $4.50/hour, so you don't need to wire up three separate providers. With PyAudio for microphone access and the websockets library, the entire agent fits in well under 100 lines of Python.

What's the difference between the Voice Agent API and chaining STT-LLM-TTS in Python?

The chained approach uses three providers: a streaming STT API like AssemblyAI Universal-3 Pro Streaming, an LLM like GPT-4o, and a streaming TTS like ElevenLabs. You write the WebSocket bridge, turn detection logic, and retry handling yourself. The Voice Agent API replaces all of that with a single WebSocket — one provider, one bill, one set of logs. Chained pipelines give you finer control over each layer; the Voice Agent API is faster to ship and easier to operate at scale.

How do I add tool calling to a Python voice agent?

Define tools as JSON schemas in the tools field of your session.update message — each tool needs "type": "function", a name, a description, and a parameter schema. When the LLM decides to call a tool, the Voice Agent API emits a tool.call event on the WebSocket with the tool name, arguments (as a Python dict), and a call_id. Your Python dispatcher runs the actual function, then you send back a tool.result event with that call_id and the result. Send tool results inside your reply.done handler, not immediately on tool.call — the agent speaks a transition phrase while the tools run.

How low can latency go on a Python voice agent?

A well-tuned Python voice agent on the Voice Agent API typically lands at 450–950ms perceived latency from end-of-turn to first audio out. The biggest wins are: (1) keep mic chunks small (~50ms) so end-of-turn detection fires fast, (2) don't block in your tool dispatcher — cache and timeout aggressively, and (3) play reply.audio chunks as they arrive instead of buffering. Universal-3 Pro Streaming alone hits 307ms P50 for transcription, which is the floor for the STT layer.

Can I use a different LLM with the Voice Agent API?

The Voice Agent API ships with frontier-quality LLMs under the hood, selected for low-latency conversational performance. If you specifically need a model that isn't available through the Voice Agent API, you can fall back to a chained architecture where you use AssemblyAI Universal-3 Pro Streaming for the STT layer and bring your own LLM and TTS. Most teams find the Voice Agent API model selection meets their needs and prefer the simpler architecture.

How do I handle interruptions in a Python voice agent?

The Voice Agent API detects barge-in semantically: back-channels like "uh-huh" don't interrupt, but "wait, stop" does. When the user actually interrupts, the server emits reply.done with status: "interrupted" and transcript.agent with interrupted: true. Your Python client should flush the speaker buffer (close and reopen the PyAudio output stream, or use sounddevice.abort()), drop any pending tool results, and continue listening for the user's new turn. This is what makes interruptions feel natural — the agent stops talking immediately instead of waiting for the previous reply to finish.

How to create an AI cold-calling agent with the Voice Agent API

Mart Schweiger — Tue, 19 May 2026 19:26:28 +0000

An AI cold-calling agent placed correctly does 500 lead-qualification calls in parallel for the cost of a single SDR. Placed poorly, it sounds like a robocall and gets hung up on in five seconds. The difference between the two isn't the LLM or the TTS — it's the speech accuracy on phone audio, the turn-taking that decides whether the agent interrupts a hesitant prospect, and the compliance layer that keeps you out of TCPA trouble.

This tutorial walks through building an AI cold-calling agent on the AssemblyAI Voice Agent API for the conversation layer, with Twilio for outbound dialing. The Voice Agent API gives you one WebSocket for STT, LLM, TTS, turn detection, and tool calling — you don't wire three providers together. You write the outbound dialer, the compliance gate, and the function dispatcher. The companion repository is linked at the end.

If you're looking for the chained STT + LLM + TTS architecture instead, our original AI cold-calling agent guide covers that path with Universal-3 Pro Streaming directly.

What an AI cold-calling agent does

An AI cold-calling agent is an outbound voice AI system that dials a prospect, delivers a pitch in natural conversation, adapts in real time based on what the prospect says, and books qualified meetings or gathers disposition data. Unlike a robocall (one-way recorded message) or a power dialer with a human rep, it conducts a two-way conversation autonomously.

The use cases where AI cold-calling agents work well today share three traits — high volume, structured pitch, and concrete success criteria (see our outbound calls walkthrough for the simpler "agent dials a single number" pattern):

Outbound SDR prospecting : open with a relevant hook, qualify BANT, book a demo
Appointment setting for field sales, financial advisors, home services
Re-engagement of lapsed leads in a CRM
Survey and research calls at scale
Event follow-up and RSVP confirmation
Renewal and upsell motions for existing customers

The common thread: one script, thousands of conversations, a measurable booking rate or disposition. That's where the Voice Agent API's combination of speech accuracy, tool calling, and flat-rate pricing pays for itself.

Architecture

  CRM / lead list (Salesforce, HubSpot, CSV)
       │
       ▼
  dialer.py
       │  compliance_gate()  ← TCPA, DNC, state laws, time windows
       ▼
  Twilio outbound dial
       │  TwiML → open Media Stream
       ▼
  bridge_server.py
       │  Twilio Media Stream ↔ Voice Agent API WebSocket
       ▼
  AssemblyAI Voice Agent API
   ┌──────────────────────────────────┐
   │  STT + Turn detection             │
   │      ↓                            │
   │  LLM with sales prompt + tools    │
   │      ↓                            │
   │  TTS                              │
   └──────────────────────────────────┘
       │
       │  tool calls
       ▼
  - book_meeting    (calendar API)
  - log_disposition (CRM update)
  - honor_dnc       (suppression list)
  - mark_callback   (scheduling)

The Voice Agent API handles the conversation. Your code handles three things outside the conversation: the dialer (who to call, when, at what concurrency), the compliance gate (TCPA, DNC, state consent), and the tool dispatcher (book a meeting, update the CRM, honor a do-not-call request).

Why use the Voice Agent API for cold-calling

Three things make the Voice Agent API a strong fit for outbound voice agents:

Speech accuracy on phone audio. Cold calls capture emails, phone numbers, company names, and job titles — "five one five, nine eight two, four zero zero zero," "J at acme dot io," "director of rev ops." Universal-3 Pro Streaming (the STT layer under the Voice Agent API) delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. That's the difference between a booked meeting in your calendar and a typo you never catch.
Tool calling that maps to the booking moment. When a prospect says "yes, Tuesday at 2pm works," the agent has to fire book_meeting immediately — not in the next turn. The Voice Agent API's tool calling is structured-output reliable, which matters when one missed booking is the whole point of the call.

Flat $4.50/hour pricing. Outbound is bursty by nature. You don't want per-token surprises when the dialer fires 500 simultaneous calls. The Voice Agent API's flat hourly rate covers STT, LLM, TTS, and tool calls all-in.

Before you start

You'll need:

An AssemblyAI account with Voice Agent API access
A Twilio account with an outbound-capable phone number (and a verified caller ID if your trial requires it)
A list of leads with consent to be contacted (CSV is fine for testing — production should integrate your real CRM)
Python 3.11+

Install:

pip install fastapi uvicorn "websockets>=14" python-dotenv twilio

Step 1: Build the compliance gate first

Compliance is where AI cold-calling teams burn the most money — TCPA fines run $500–$1,500 per violating call. Build the gate before you write a line of dialer code.

# compliance.py
from datetime import datetime
from zoneinfo import ZoneInfo

DNC_LIST = set(open("suppression.txt").read().split())  # internal DNC

def compliance_gate(lead):
    # 1. Internal suppression (previous DNC requests, unsubscribes)
    if lead["phone"] in DNC_LIST:
        return False, "internal DNC"

    # 2. Federal DNC registry — integrate a real provider in production
    if on_federal_dnc(lead["phone"]):
        return False, "federal DNC"

    # 3. Time window — TCPA bans calls before 8am or after 9pm local
    local_tz = ZoneInfo(lead.get("timezone", "America/New_York"))
    local_hour = datetime.now(local_tz).hour
    if local_hour < 8 or local_hour >= 21:
        return False, f"outside TCPA window ({local_hour}:00 local)"

    # 4. State consent — California, Florida, PA require two-party consent
    if lead.get("state") in {"CA", "FL", "PA", "WA", "IL", "MD", "MT", "NH"}:
        # Agent must disclose recording at the top of the call.
        lead["needs_recording_disclosure"] = True

    return True, "ok"

Build this as a hard gate. No call goes out if any check fails.

Step 2: Define the agent's tools

Four tools the agent can call mid-conversation. In production, replace the stubs with real CRM, calendar, and DNC API calls. Each tool needs "type": "function" at the top level — the Voice Agent API validates this on session.update.

# tools.py
TOOLS = [
    {
        "type": "function",
        "name": "book_meeting",
        "description": "Book a meeting on the rep's calendar.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "preferred_time": {"type": "string"},
                "email": {"type": "string"},
            },
            "required": ["lead_id", "preferred_time", "email"],
        },
    },
    {
        "type": "function",
        "name": "log_disposition",
        "description": "Record the call outcome in the CRM.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "disposition": {
                    "type": "string",
                    "enum": ["booked", "not_now", "not_interested",
                             "wrong_person", "left_voicemail", "dnc"],
                },
                "notes": {"type": "string"},
            },
            "required": ["lead_id", "disposition"],
        },
    },
    {
        "type": "function",
        "name": "honor_dnc",
        "description": "Add the prospect to the do-not-call list immediately.",
        "parameters": {
            "type": "object",
            "properties": {"lead_id": {"type": "string"}, "phone": {"type": 
"string"}},
            "required": ["lead_id", "phone"],
        },
    },
    {
        "type": "function",
        "name": "mark_callback",
        "description": "Schedule a callback at the prospect's preferred time.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "preferred_time": {"type": "string"},
            },
            "required": ["lead_id", "preferred_time"],
        },
    },
]

The honor_dnc tool is the most important one. If the prospect says anything that sounds like a do-not-call request — "take me off your list," "don't call me again," "remove me" — the agent must call this tool immediately , acknowledge, and end the call politely. No upselling, no "can I just ask one question." TCPA violations on DNC requests are the most expensive mistake a cold-calling agent can make.

Step 3: Write the system prompt

The system prompt is where the script lives. Four sections every cold-calling prompt needs:

# prompts.py
SYSTEM_PROMPT = """You are an AI sales development representative for Datafold.
You are calling {prospect_name}, {prospect_title} at {prospect_company}.

DISCLOSURE (required):
- Open every call by stating: "Hi {first_name}, this is an AI assistant calling
  on behalf of Datafold."
- This is non-negotiable and legally required in CA, FL, TX, and several other states.

OPENER (15 seconds):
- "I'm reaching out because we help data teams catch breaking changes before
  they hit production. Do you have 30 seconds for me to explain why I'm calling?"
- If yes, continue. If no, ask when's better and call mark_callback.

DISCOVERY (ask only 2 questions, max):
1. "How is your team handling data quality today — manual review, dbt tests,
   or something else?"
2. "How often does a broken model make it to production?"

PITCH (one sentence):
- "Datafold gives data teams CI for their pipelines. Customers like Patreon
  and Faire catch 90% of regressions before they ship."

CTA:
- Offer two specific times in the prospect's time zone.
- Call book_meeting with their email when they accept.

OBJECTION MAP:
- "How did you get my number?" → "You opted in on our website last month."
- "Send me an email" → "Happy to. What's the best address?" (call mark_callback)
- "Not the right person" → "Who handles data quality on your team?"
- "We already use [X]" → "Got it. Most of our customers use [X] alongside Datafold."
- "Not interested" → "No problem. Mind if I ask why?" (then call log_disposition)

DNC HANDLING (highest priority):
- If the prospect says ANYTHING like "take me off your list," "don't call me
  again," "remove me," "stop calling": call honor_dnc IMMEDIATELY, say "Of
  course, you're removed from our list. Sorry to bother you. Have a good day,"
  and end the call. Do NOT try to recover the conversation.

STYLE:
- One or two sentences per turn. Conversational, not formal.
- Listen for tone. If they sound annoyed, wrap up gracefully.
- Never claim to be human. If asked, confirm you're AI.
"""

That prompt is the entire sales playbook. The Voice Agent API will follow it turn by turn, calling tools when the conversation hits the right moments.

Step 4: Wire up the dialer

The dialer pulls leads from your list, runs each through the compliance gate, and places Twilio calls. It controls concurrency and respects time-of-day rules.

# dialer.py
import asyncio
import csv
import os
from twilio.rest import Client

twilio = Client(os.environ["TWILIO_SID"], os.environ["TWILIO_TOKEN"])

async def dial_lead(lead, callback_url):
    ok, reason = compliance_gate(lead)
    if not ok:
        log_disposition(lead["lead_id"], "skipped", notes=reason)
        return

    call = twilio.calls.create(
        to=lead["phone"],
        from_=os.environ["TWILIO_FROM"],
        url=f"{callback_url}/twilio/voice?lead_id={lead['lead_id']}",
        machine_detection="Enable",  # Hang up on voicemail
        record=True,                  # Required for compliance/QA
    )
    print(f"Dialing {lead['lead_id']}: {call.sid}")

async def run_dialer(leads_csv, max_concurrent=10):
    sem = asyncio.Semaphore(max_concurrent)
    with open(leads_csv) as f:
        leads = list(csv.DictReader(f))

    async def with_limit(lead):
        async with sem:
            await dial_lead(lead, os.environ["PUBLIC_URL"])
            await asyncio.sleep(2)  # pace
    await asyncio.gather(*(with_limit(l) for l in leads))

The machine_detection="Enable" flag tells Twilio to hang up on voicemail rather than wasting a Voice Agent API session on a robot. Important: never leave a recorded message — that's a TCPA violation in most contexts.

Step 5: Bridge Twilio Media Streams to the Voice Agent API

The bridge server is what connects Twilio's outbound call audio to the Voice Agent API WebSocket. Twilio sends G.711 μ-law at 8 kHz; the Voice Agent API accepts it natively when you set the encoding to audio/pcmu.

A few details that are easy to get wrong on this endpoint specifically:

The auth header is Authorization: Bearer YOUR_KEY — note the Bearer prefix. This is unique to the Voice Agent API; the rest of AssemblyAI accepts the raw key.
The first WebSocket message is a session.update event with all config nested under a session object. There is no session.start.
The agent's voice is a named voice from the Voice Agent API catalog (ivy, james, sophie, etc.) — not an ElevenLabs voice ID.
The telephony audio encoding is audio/pcmu (G.711 μ-law). Sample rate is implicit (8 kHz). Don't pass pcm_mulaw or a sample_rate field — the API ignores them.

You must wait for session.ready before sending any input.audio frames.

# bridge_server.py
import asyncio, json, os
import websockets
from fastapi import FastAPI, Query, Request, WebSocket
from fastapi.responses import Response

from prompts import SYSTEM_PROMPT
from tools import TOOLS, dispatch_tool

VOICE_AGENT_WS = "wss://agents.assemblyai.com/v1/ws"
ASSEMBLYAI_KEY = os.environ["ASSEMBLYAI_API_KEY"]

app = FastAPI()

@app.post("/twilio/voice")
async def twilio_voice(request: Request, lead_id: str = Query(...)):
    host = request.url.hostname
    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://{host}/media-stream?lead_id={lead_id}" />
  </Connect>
</Response>"""
    return Response(content=twiml, media_type="application/xml")

@app.websocket("/media-stream")
async def media_stream(twilio_ws: WebSocket, lead_id: str = Query(...)):
    await twilio_ws.accept()
    lead = LEAD_CACHE[lead_id]
    stream_sid = {"value": None}

    session_config = {
        "type": "session.update",
        "session": {
            "system_prompt": SYSTEM_PROMPT.format(**lead),
            "tools": TOOLS,
            "input": {"format": {"encoding": "audio/pcmu"}},
            "output": {
                "voice": "ivy",
                "format": {"encoding": "audio/pcmu"},
            },
        },
    }

    async with websockets.connect(
        VOICE_AGENT_WS,
        additional_headers={"Authorization": f"Bearer {ASSEMBLYAI_KEY}"},
    ) as va_ws:
        await va_ws.send(json.dumps(session_config))

        ready = asyncio.Event()
        pending_tools = []

        async def pump_twilio_to_va():
            async for raw in twilio_ws.iter_text():
                event = json.loads(raw)
                kind = event.get("event")
                if kind == "start":
                    stream_sid["value"] = event["start"]["streamSid"]
                elif kind == "media":
                    if not ready.is_set():
                        continue
                    # Twilio sends base64 mulaw; AAI accepts it directly.
                    await va_ws.send(json.dumps({
                        "type": "input.audio",
                        "audio": event["media"]["payload"],
                    }))
                elif kind == "stop":
                    return

        async def pump_va_to_twilio():
            async for raw in va_ws:
                event = json.loads(raw)
                t = event.get("type")

                if t == "session.ready":
                    ready.set()

                elif t == "reply.audio" and stream_sid["value"]:
                    await twilio_ws.send_text(json.dumps({
                        "event": "media",
                        "streamSid": stream_sid["value"],
                        "media": {"payload": event["data"]},
                    }))

                elif t == "tool.call":
                    result = dispatch_tool(event["name"], event.get("arguments",
{}))
                    pending_tools.append({"call_id": event["call_id"], "result":
result})

                elif t == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                    else:
                        for tool in pending_tools:
                            value = tool["result"]
                            if not isinstance(value, str):
                                value = json.dumps(value)
                            await va_ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": value,
                            }))
                        pending_tools.clear()

                elif t == "transcript.user":
                    print(f"[{lead_id}] User: {event['text']}")
                elif t == "transcript.agent":
                    print(f"[{lead_id}] Agent: {event['text']}")

        await asyncio.gather(pump_twilio_to_va(), pump_va_to_twilio())

Two subtleties worth understanding:

Tool result timing. Per the tool calling docs, accumulate tool results when tool.call fires and send them inside reply.done — not immediately. The agent speaks a transition phrase ("let me check") while the tools run; sending too early causes timing issues.
Audio pass-through. Twilio's media.payload and AssemblyAI's input.audio.audio (and reply.audio.data) are all base64-encoded μ-law strings, so the bridge moves bytes through without any decode/re-encode step.

Compliance: the part most teams underweight

Three things separate a working AI cold-calling agent from a $50,000 TCPA settlement:

Scrub against the federal DNC registry before every call. Integrate a real provider — DNC.gov has a paid programmatic feed.
Honor state DNC lists. Several states maintain their own — California, Pennsylvania, Indiana, Tennessee. Your scrub vendor should cover these.
Two-party consent disclosure. In CA, FL, PA, WA, and several other states, you must disclose at the top of the call that the call is being recorded and that the caller is AI. Your system prompt's DISCLOSURE section is doing this work — never remove it.

Build all three as hard gates. If any check fails, the call doesn't go out. Log every disposition with a timestamp so you can prove compliance during an audit.

Measuring success

Three numbers tell you whether your AI cold-calling agent is working (see our broader AI voice agents guide for context on conversion metrics across use cases):

Connection rate : percentage of calls that reach a live human. Healthy: 30–50% with a local-presence dialer.
Conversation rate : percentage of connected calls that last more than 30 seconds. Healthy: 25–40%.
Book rate : percentage of conversations that end in a booked meeting. Healthy: 5–15% for warm/intent leads, 1–3% for cold lists.

Read every transcript for the first 500 calls. You'll catch prompt failures, silently wrong transcriptions on company names, and tool-call timing issues that you'd never notice listening to the audio.

The complete repository

Fork the runnable repo at github.com/kelsey-aai/cold-calling-voice-agent-api. It includes the dialer, the compliance gate, the bridge server, the tool dispatcher, the system prompt, and a sample leads.csv. Around 400 lines of Python total.

Frequently asked questions

How do I create an AI cold-calling agent with the Voice Agent API?

To create an AI cold-calling agent with the AssemblyAI Voice Agent API, build four pieces: a dialer that pulls leads from your CRM and places outbound Twilio calls, a compliance gate that scrubs against DNC registries and TCPA time windows, a bridge server that connects Twilio Media Streams to the Voice Agent API WebSocket at wss://agents.assemblyai.com/v1/ws, and a tool dispatcher with book_meeting, log_disposition, honor_dnc, and mark_callback. Define a sales-specific system prompt with disclosure, opener, discovery, pitch, CTA, objection map, and DNC handling rules. The Voice Agent API handles the conversation — your code handles dialing, compliance, and integrations.

Is AI cold-calling legal?

AI cold-calling is legal in most U.S. jurisdictions if you comply with TCPA (federal), state-level consent laws, and disclose that the caller is AI. Specifically: scrub against the federal DNC registry before every call, respect TCPA calling windows (no calls before 8am or after 9pm in the recipient's local time), get two-party consent for recording in states that require it (CA, FL, PA, WA, and others), and disclose AI identity at the top of the call. The cost of getting this wrong is steep — $500–$1,500 per violating call. Build the compliance gate as a hard barrier and consult legal counsel before scaling.

How much does it cost to run an AI cold-calling agent?

On the AssemblyAI Voice Agent API, you pay $4.50/hour of session time — STT, LLM, TTS, turn detection, and tool calls included. Twilio outbound voice adds a few cents per minute. A typical 90-second qualification call costs roughly $0.12–$0.18 all-in. At the typical 30–50% connection rate, the cost per actual conversation is closer to $0.30. Compare against a human SDR at fully-loaded $70–100/hour and the unit economics generally favor the agent for high-volume top-of-funnel motions.

What speech-to-text accuracy do I need for cold-calling?

The accuracy that matters for cold-calling is alphanumeric accuracy on phone audio — capturing emails, phone numbers, company names, and job titles correctly the first time. Universal-3 Pro Streaming, which is the STT layer under the Voice Agent API, delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. That accuracy is the difference between booking a meeting in the rep's calendar (alex@acme.io) and a typo your CRM never catches (alec@akme.io).

Can the Voice Agent API place outbound calls directly?

Today, you use Twilio (or another telephony provider) for the outbound dial, and bridge the resulting Media Stream into the Voice Agent API WebSocket. The Voice Agent API handles the conversation; Twilio handles the PSTN connection and the audio transport. Native outbound dialing through the Voice Agent API is on the roadmap — the bridge pattern in this tutorial is the standard path today, and the code in the companion repo handles it cleanly in about 100 lines.

Multi-language voice agents: Building agents that speak to anyone

Mart Schweiger — Tue, 19 May 2026 19:25:44 +0000

Building multilingual voice agents requires coordinating four critical components—speech-to-text, language models, text-to-speech, and orchestration software—all working together within strict timing constraints to maintain natural conversation flow. The challenge isn't just connecting these pieces; each component must handle multiple languages, accents, and real-time language switching while keeping responses under one second.

This guide walks you through the technical architecture, performance requirements, and implementation considerations for production multilingual voice agents. You'll learn how to handle automatic language detection, manage code-switching scenarios where users mix languages mid-sentence, and build systems that maintain conversation context across language transitions—essential knowledge for creating voice experiences that truly work for global audiences.

What are the core components of a multilingual voice agent?

A multilingual voice agent is an AI system that listens to speech in multiple languages, understands what you're saying, and responds back in natural conversation. This means it can handle a customer service call where someone starts speaking Spanish, switches to English for technical terms, then back to Spanish—all in real-time.

You need four components working together: speech-to-text converts your voice to text, language models understand and generate responses, text-to-speech converts responses back to speech, and orchestration software coordinates everything within milliseconds.

The challenge isn't just connecting these pieces. Each component must handle multiple languages while keeping the conversation feeling natural and fast.

Speech-to-text for multilingual support

Speech-to-text (STT) is the foundation that converts spoken words into text that AI models can understand. This means turning "¿Puedes ayudarme?" into text that the system can process, regardless of accent or speaking speed.

You have two main processing options: streaming transcription that processes speech as you speak, and batch processing that waits for complete sentences. Voice agents need streaming transcription because users expect responses before they finish talking.

Here's what makes multilingual STT challenging:

Language detection: The system must identify which language you're speaking within seconds
Accent handling: Spanish from Mexico sounds different from Spanish from Argentina
Code-switching: When you mix languages mid-sentence like "Can you check mi cuenta"

If your speech-to-text gets "schedule appointment" wrong as "cancel appointment," even perfect AI models downstream can't fix that error.

Language models and multilingual reasoning

Language models take the transcribed text and figure out what you actually want, then generate appropriate responses. Large Language Models (LLMs) handle multiple languages through two approaches: translating everything to one language internally, or processing multiple languages directly.

Direct multilingual processing works better because it keeps cultural context intact. "How can I help you?" and "¿En qué puedo ayudarle?" aren't just translations—they carry different levels of formality that matter for customer experience.

Your language model also needs to remember context when you switch languages. If you start in Spanish, use English technical terms, then return to Spanish, the model must follow along without losing track of what you're trying to accomplish.

Text-to-speech synthesis across languages

Text-to-speech (TTS) turns the AI's written response back into natural speech. This isn't just pronunciation—it's matching the rhythm, emotion, and cultural tone appropriate for each language.

Modern TTS systems offer multiple voice options per language:

Demographics: Different ages, genders, and speaking styles
Regional accents: British vs American English, European vs Latin American Spanish
Tone matching: Professional for banking, casual for shopping, empathetic for support

Some languages create unique challenges. Mandarin uses pitch to change word meaning, while Arabic connects words in complex ways that affect pronunciation.

Real-time orchestration and coordination

Orchestration software acts like air traffic control for your voice agent. This means managing timing between components, handling interruptions when users start speaking again, and keeping conversation state—all while staying under one second response time.

Think of orchestration as the conductor making sure your voice agent doesn't talk over users, doesn't lose context, and recovers gracefully from errors.

Key responsibilities include:

Pipeline management: Moving data smoothly between STT, LLM, and TTS
Interruption handling: Stopping playback when users interrupt
State tracking: Remembering conversation history and language preferences
Error recovery: Handling network issues without breaking the conversation

What are the performance requirements for multilingual voice agents?

Users expect voice agents to respond within one second of finishing their sentence. Anything longer makes conversations feel awkward and unnatural.

Here's where that crucial second gets spent:

Component	Time used	What happens
Speech-to-text	200–400ms	Converting your speech to text
LLM processing	100–300ms	Understanding and generating response
Text-to-speech	300–600ms	Converting response to speech
Network overhead	50–100ms	Data moving between systems
Total target	Under 1000ms	Must stay under one second

Multilingual support makes these targets harder to hit. Language detection adds time, some languages process slower than others, and translation (when needed) creates additional delays.

Latency requirements for conversational quality

The one-second rule comes from natural human conversation patterns. People typically pause 200–500ms before responding, so a voice agent responding in 800ms feels natural while 1500ms creates awkward silence.

But perceived speed matters more than actual speed. If your agent starts responding quickly—even with "Let me check that for you"—users perceive faster service than an agent that stays silent for 800ms then gives a complete answer.

Streaming helps here. Instead of waiting for complete responses, you can start speaking as soon as the first few words are ready. This cuts perceived latency by 30–40% while keeping the same actual processing time.

Accuracy requirements across languages and accents

You need at least 90% word accuracy across all supported languages for reliable voice agents. The challenge? That 90% must work for English speakers from Boston, Spanish speakers from Mexico, and Mandarin speakers from Beijing—not just clear, neutral accents.

Errors compound through your pipeline. If speech-to-text achieves 85% accuracy and your language model correctly interprets 90% of that text, you're down to 76% end-to-end accuracy. That's barely better than guessing for complete interactions.

Critical accuracy areas include:

Names and addresses: Personal information must be captured exactly
Numbers: Account numbers, phone numbers, and dollar amounts can't have errors
Intent preservation: The core request must survive even if some words are wrong

High-quality speech-to-text models like AssemblyAI's Universal-2 model support 99 languages with industry-leading accuracy, creating a reliable foundation when errors can't be tolerated.

Key implementation considerations

Moving from prototype to production means solving practical challenges that don't show up in demos. These details often determine whether your voice agent delights users or frustrates them.

Language detection and real-time switching

Automatic language detection sounds straightforward—identify the language and proceed. Real conversations are messier. Users greet in one language then switch to another, use technical English terms while speaking Spanish, or have accents that confuse detection.

Most successful systems use a hybrid approach:

Initial detection: Identify language from the first 2–3 seconds of speech
Confidence scoring: Avoid false switches when detection isn't certain
Context clues: Use user profiles or phone number regions as hints

The trickiest scenario? Code-switching where users naturally mix languages mid-sentence. "Can you check mi cuenta, I think there's a problem" requires handling English and Spanish simultaneously without breaking conversation flow.

Testing multilingual voice agent accuracy

Testing multilingual voice agents requires systematic validation across language combinations, not just individual languages. A system perfect in English and Spanish separately might fail when users switch between them.

Start with single-language testing using native speakers with various accents and natural speaking styles. Record actual conversations, not scripted readings—natural speech includes hesitations, corrections, and informal phrases that scripts miss.

Then test language transitions:

Mixed conversations: Spanish speakers using English product names
Technical explanations: Users switching languages to explain complex issues
Cultural context: Different communication styles across cultures

Essential testing scenarios include accent variations across regions, background noise from realistic environments, different speaking speeds, and code-switching patterns common in your user base.

Common use cases for multilingual voice agents

Multilingual voice agents excel where businesses need to serve diverse populations efficiently. Here are three high-impact applications you're likely to encounter.

Customer support automation

Customer support represents the biggest deployment of multilingual voice agents today. These systems handle routine requests—password resets, balance checks, order tracking—in dozens of languages without requiring multilingual human agents for every shift.

Success depends on seamless escalation to humans. When the voice agent can't resolve your issue, it must transfer you to a human agent while preserving conversation context and language preference. Nobody wants to repeat their problem in a different language.

Integration with existing systems matters here. The voice agent needs access to your account information and ability to update records in real-time. This means a Spanish-speaking customer can check order status, update delivery addresses, and receive confirmation without waiting for a Spanish-speaking human agent.

Voice assistants for global applications

Consumer apps use multilingual voice assistants to reach global markets. Think banking apps that let you check balances, transfer money, or report lost cards through voice commands in your preferred language.

These applications need cultural adaptation beyond translation. A voice assistant in Japan should understand indirect communication styles, while one in New York can be more direct. The same request gets phrased completely differently based on cultural expectations.

Privacy becomes critical with sensitive financial or personal information. Your voice agent must handle this data across different regulatory environments while maintaining consistent service quality.

Contact center automation

Enterprise contact centers deploy multilingual voice agents to handle peak call volumes and provide 24/7 coverage. Instead of staffing overnight shifts with multilingual agents, you deploy voice agents that handle routine calls in any supported language.

The business case is clear: one multilingual voice agent replaces dozens of language-specific phone menu systems while providing better service. Callers get natural conversation instead of pressing buttons through complex menus.

Compliance considerations vary by industry and caller location. Your voice agent must adapt its behavior for call recording requirements, data retention rules, and disclosure obligations based on applicable regulations.

Final words

Building reliable multilingual voice agents requires coordinating speech-to-text, language models, text-to-speech, and orchestration—all working within tight timing constraints that keep conversations natural. Your foundation starts with accurate speech recognition, because transcription errors cascade through every step, turning helpful interactions into frustrated customers.

The implementation challenges we've covered show why thoughtful architecture matters more than raw technology. With accurate transcription as your starting point, you can build voice agents that truly communicate with anyone, anywhere.

Frequently asked questions

What components do I need to build a multilingual voice agent?

You need four integrated components: speech-to-text for converting speech to text, language models for understanding and generating responses, text-to-speech for voice synthesis, and orchestration software to coordinate everything in real-time within one second.

How quickly do multilingual voice agents need to respond?

Target under 1000ms end-to-end latency for natural conversation flow. This includes 200–400ms for speech-to-text, 100–300ms for language model processing, and 300–600ms for text-to-speech synthesis.

Can voice agents detect language automatically during conversations?

Yes, modern speech-to-text models detect language within the first 2–3 seconds of speech and can handle language switches mid-conversation. The system maintains conversation context across language changes without requiring users to specify their language preference.

What speech accuracy do I need for multilingual voice agents?

Aim for at least 90% word accuracy across all supported languages and accents. Lower accuracy causes errors to compound through the pipeline, reducing end-to-end reliability below acceptable thresholds for production deployment.

How do I test multilingual voice agent performance before launch?

Test systematically with native speakers across regional accents, speaking speeds, and background noise conditions. Validate both single-language accuracy and language-switching scenarios, measuring word error rates, intent recognition, and task completion rates.

What infrastructure supports multilingual voice agents at scale?

You need streaming speech-to-text APIs, multilingual language model services, text-to-speech capabilities, and orchestration platforms that handle concurrent conversations. The infrastructure must scale horizontally without degrading response times.

Do multilingual voice agents handle mixed-language conversations?

Advanced speech-to-text models can transcribe code-switching where speakers mix languages mid-sentence. Success depends on training data that includes natural bilingual speech patterns and systems designed to maintain context across language transitions.

Build a voice agent for telehealth triage

Mart Schweiger — Tue, 19 May 2026 19:25:34 +0000

Build a voice agent for telehealth triage

A telehealth triage voice agent answers a patient's call, captures symptoms in their own words, scores severity against a defined protocol, and routes the patient to the right care level — emergency, urgent care, virtual visit, or self-care guidance. It doesn't diagnose, doesn't prescribe, and doesn't decide; it triages, in the same way an experienced nurse on a phone line would, then hands off with structured notes attached.

This tutorial walks through building one on the AssemblyAI Voice Agent API with a clinical-specialty prompt and the architectural controls HIPAA requires — encrypted audio, BAA-backed deployment, PII redaction, and audit logging. We'll cover the triage protocol, symptom capture, severity scoring with tool calls, and the handoff that gets the patient to the right next step. The companion repository is linked at the end.

This is a triage agent, not a clinical decision-maker. Everything in this guide assumes a human clinician makes the final call — the voice agent's job is to capture the data, run the protocol, and route the patient.

What telehealth triage looks like as a voice agent

A triage call follows a predictable structure. The agent:

Greets the patient and confirms identity (name, date of birth)
Asks for the chief complaint in the patient's own words
Walks through a symptom protocol (when did it start, severity, associated symptoms)
Captures red-flag symptoms that escalate severity
Calls a score_severity tool that runs the captured symptoms through a triage algorithm
Routes the patient — ER (911), urgent care, scheduled visit, or self-care
Logs structured notes to the EHR for the receiving clinician

This pattern works for telehealth voice agents because it has a defined protocol, concrete success criteria (was the patient routed correctly?), and a clear failure mode (escalate to a human nurse if anything is unclear). It's not asking the voice agent to diagnose.

Why use the Voice Agent API for telehealth triage

Three properties matter specifically for healthcare:

Speech accuracy on medical terminology. Patients say "metoprolol" and "lisinopril" and "I have a history of A-fib." A model that mishears any of these creates a downstream safety issue. Universal-3 Pro Streaming, the STT layer under the Voice Agent API, performs strongly on medical conversations; for post-call note generation and billing-grade documentation, AssemblyAI's Medical Mode async API is purpose-built for clinical terminology.
BAA-backed deployment for processing PHI. AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process protected health information (PHI), and offers a Business Associate Addendum (BAA) required under HIPAA. Without a BAA you legally cannot route PHI through the service, regardless of how good the model is. Contact our sales team to execute a BAA.
Tool calling for protocolized triage. The triage protocol lives in tool calls — score_severity, route_to_care_level, schedule_callback, escalate_to_nurse. The agent calls tools rather than generating free-form clinical guidance, which is what keeps the system inside the bounds of triage and out of the bounds of diagnosis.

Architecture

  Patient call (PSTN via Twilio, or telehealth app)
        │
        ▼
  Voice Agent API (one WebSocket)
   ┌────────────────────────────────────┐
   │  Universal-3 Pro Streaming (STT)    │
   │     ↓                               │
   │  LLM with triage protocol           │
   │     ↓                               │
   │  TTS                                │
   └────────────────────────────────────┘
        │
        │  tool calls
        ▼
   Tool dispatcher
    - capture_symptom         (structured)
    - score_severity          (runs triage algorithm)
    - route_to_care_level     (ER / urgent / scheduled / self-care)
    - escalate_to_nurse       (live RN handoff)
    - log_to_ehr              (encrypted PHI write)

  (post-call)
        │
        ▼
   Async Medical Mode API
   - billing-grade SOAP note
   - ICD-10 candidate codes
   - quality review

The Voice Agent API runs the patient-facing conversation. The protocol logic lives in your tools. Post-call documentation goes through the async Medical Mode API for clinical-quality notes.

Before you start

You need:

An AssemblyAI account — for healthcare deployments, contact our sales team to execute a BAA before processing any PHI
A defined triage protocol from your clinical team. This guide uses a simplified version for illustration; your real protocol should come from licensed clinicians and be reviewed against ESI (Emergency Severity Index) or your organization's equivalent
An EHR integration target (Epic, Cerner, athena, custom)
A licensed RN available for live escalations

Important: Don't deploy a telehealth triage agent into production without (1) a BAA executed with AssemblyAI, (2) clinical review of every prompt and tool, (3) an always-available escalation path to a human nurse, and (4) IRB or compliance review per your organization's policies. The agent in this tutorial is a working starter — not a production-ready clinical system.

Step 1: Define the triage protocol in the system prompt

The system prompt is where the protocol lives. Three rules that make the difference between a triage agent and a chatbot:

SYSTEM_PROMPT = """You are an AI telehealth triage assistant for ACME Health.

You are NOT a doctor. You do NOT diagnose. You do NOT prescribe. Your job is
to capture symptoms, run a triage protocol, and route the patient to the
right care level. A licensed clinician makes the final decision.

CALL FLOW:
1. Greet the patient. Confirm name and date of birth.
2. Ask the chief complaint in their own words. Capture it verbatim using
   capture_symptom(category='chief_complaint', detail=...).
3. Walk through the OPQRST protocol:
   - Onset (when did it start?)
   - Provocation/Palliation (what makes it worse or better?)
   - Quality (sharp, dull, throbbing?)
   - Region/Radiation (where, does it spread?)
   - Severity (1–10)
   - Timing (constant, intermittent?)
   Call capture_symptom for each.
4. Screen for red flags relevant to the complaint:
   - Chest pain / shortness of breath / arm pain → cardiac red flags
   - Severe headache / vision changes / weakness → stroke red flags
   - High fever / stiff neck → meningitis red flags
   - Severe abdominal pain / blood → surgical red flags
   - Suicidal ideation → mental health red flags
   If ANY red flag is present, call escalate_to_nurse IMMEDIATELY and
   say: "These symptoms need immediate attention. I'm connecting you to
   our on-call nurse right now."
5. Call score_severity with all captured symptoms.
6. Based on the result, call route_to_care_level with the recommendation.

CRITICAL RULES:
- Never tell the patient what they have. Use "your symptoms suggest..." not
  "you have...".
- Never recommend medication or dosage changes.
- If the patient asks medical questions outside triage, say:
  "I can't answer that. Let me connect you with our nurse line."
  and call escalate_to_nurse.
- If you're uncertain at any point, escalate.

STYLE:
- Speak calmly. One or two sentences per turn.
- Use plain language, not medical jargon. "Pressure in your chest" not
  "thoracic discomfort".
- Confirm critical details back: "You said the pain started Tuesday — is
  that right?"
"""

The escalate-on-uncertainty rule is the most important. A triage agent that confidently routes a heart attack to "schedule a visit" is dangerous. One that escalates to a human nurse the moment red flags appear is safe.

Step 2: Define the tools

Each tool needs "type": "function" at the top level — the Voice Agent API validates this on session.update.

TOOLS = [
    {
        "type": "function",
        "name": "capture_symptom",
        "description": "Record a symptom or piece of OPQRST data.",
        "parameters": {
            "type": "object",
            "properties": {
                "category": {
                    "type": "string",
                    "enum": ["chief_complaint", "onset", "provocation",
                             "quality", "region", "severity",
                             "timing", "red_flag"],
                },
                "detail": {"type": "string"},
            },
            "required": ["category", "detail"],
        },
    },
    {
        "type": "function",
        "name": "score_severity",
        "description": (
            "Score the patient's severity based on captured symptoms. "
            "Returns an ESI-style level (1=critical, 5=non-urgent)."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "symptoms": {"type": "array", "items": {"type": "string"}},
            },
            "required": ["symptoms"],
        },
    },
    {
        "type": "function",
        "name": "route_to_care_level",
        "description": "Route the patient to the appropriate care level.",
        "parameters": {
            "type": "object",
            "properties": {
                "level": {
                    "type": "string",
                    "enum": ["emergency", "urgent_care", "scheduled_visit",
                             "self_care"],
                },
                "reason": {"type": "string"},
            },
            "required": ["level", "reason"],
        },
    },
    {
        "type": "function",
        "name": "escalate_to_nurse",
        "description": (
            "Connect the patient to a live registered nurse immediately. "
            "Call this for any red-flag symptom or any time the protocol "
            "is unclear."
        ),
        "parameters": {
            "type": "object",
            "properties": {"reason": {"type": "string"}},
            "required": ["reason"],
        },
    },
    {
        "type": "function",
        "name": "log_to_ehr",
        "description": "Write structured triage notes to the EHR.",
        "parameters": {
            "type": "object",
            "properties": {
                "patient_id": {"type": "string"},
                "symptoms": {"type": "object"},
                "severity": {"type": "integer"},
                "disposition": {"type": "string"},
            },
            "required": ["patient_id", "symptoms", "severity", "disposition"],
        },
    },
]

The score_severity tool is where your clinical algorithm lives. In the repo, it's a simple rule-based scorer for demonstration; in production, this is the function your clinical team reviews and signs off on.

Step 3: Severity scoring logic

RED_FLAG_KEYWORDS = {
    "cardiac": ["chest pain", "pressure", "tight", "shortness of breath",
                "arm pain", "jaw pain", "sweating"],
    "stroke":  ["face drooping", "weakness", "slurred speech", "vision",
                "confusion"],
    "surgical":["severe abdominal", "blood in stool", "vomiting blood",
                "rigid abdomen"],
    "sepsis":  ["high fever", "stiff neck", "altered mental"],
    "mental":  ["suicidal", "self-harm", "kill myself"],
}

def score_severity(symptoms):
    text = " ".join(s.lower() for s in symptoms)
    for category, keywords in RED_FLAG_KEYWORDS.items():
        if any(kw in text for kw in keywords):
            return {"level": 1, "category": category, "route": "emergency"}
    if any(kw in text for kw in ["severe pain", "9/10", "10/10", "can't breathe"]):
        return {"level": 2, "route": "emergency"}
    if any(kw in text for kw in ["moderate pain", "7/10", "8/10", "fever 101", "fever 102"]):
        return {"level": 3, "route": "urgent_care"}
    if any(kw in text for kw in ["mild pain", "5/10", "6/10"]):
        return {"level": 4, "route": "scheduled_visit"}
    return {"level": 5, "route": "self_care"}

This is illustrative only. Real telehealth triage uses validated scoring (ESI, AMTS, organization-specific protocols) developed and reviewed by clinical staff. Don't ship anything to production without that review.

Step 4: Audit logging and PHI controls

Every transcript event from the Voice Agent API is PHI. Treat it as such:

Encrypt at rest. Use envelope encryption (KMS) for any persisted audio or transcripts.
Encrypt in transit. The Voice Agent API WebSocket is TLS — no additional work there.
Audit log every access. Who read which call, when, from where.
Apply PII redaction to anything that leaves your VPC. Phone numbers, addresses, SSNs, names should be redacted before transcripts hit analytics warehouses or training pipelines.
Set retention policies. Most healthcare orgs retain triage call transcripts for 7 years; configure your storage accordingly.

The Voice Agent API's events (transcript.user, transcript.agent, tool.call, tool.result) are exactly what you'd write to the EHR. Build the log_to_ehr tool to flush a structured record at the end of every call.

Step 5: Test against representative cases

Before any patient calls the agent, run it against a clinical test suite:

Case	Expected route
"I have crushing chest pain and my left arm is numb"	emergency (cardiac red flag)
"I have a fever of 102 and a stiff neck"	emergency (sepsis red flag)
"I sprained my ankle yesterday, pain is 5 out of 10"	urgent_care or scheduled_visit
"I have a runny nose and slight cough for two days"	self_care
"I'm having thoughts of hurting myself"	escalate_to_nurse (mental health red flag)

Run at least 200 cases through the agent with clinician review of every disposition. The cost of a missed escalation is a clinical safety event; the cost of an over-escalation is overuse of the nurse line. Tune until both are within your organization's tolerance.

Step 6: Post-call documentation with Medical Mode

After the call, run the captured audio through AssemblyAI's Medical Mode async API for billing-grade clinical documentation. Enable it with the domain="medical-v1" parameter on a standard pre-recorded transcript request:

import assemblyai as aai

aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    domain="medical-v1",       # enables Medical Mode
    speaker_labels=True,        # provider/patient separation
    keyterms_prompt=["Lispro", "Humalog", "metoprolol"],
)
transcript = aai.Transcriber().transcribe(call_audio_url, config)
# Then send transcript.text through the LLM Gateway for SOAP generation.

Medical Mode is purpose-built for medication names, procedures, conditions, and dosages — it's billed as a separate add-on (see pricing). Combine it with LLM Gateway SOAP generation to produce structured chart entries from the transcript.

The complete repository

Fork the runnable repo at github.com/kelsey-aai/telehealth-triage-voice-agent. It includes the triage agent loop, the OPQRST protocol prompt, the red-flag scorer, the routing logic, and a sample EHR adapter stub. Around 350 lines of Python.

Frequently asked questions

How do I build a voice agent for telehealth triage?

To build a voice agent for telehealth triage, open an AssemblyAI Voice Agent API session with a clinical-specialty system prompt that walks the patient through an OPQRST symptom protocol, screens for red flags, and routes via tool calls. The agent should never diagnose or prescribe — it captures symptoms with capture_symptom, scores severity with score_severity (your clinical algorithm), routes via route_to_care_level, and escalates to a live RN through escalate_to_nurse whenever red flags appear or the protocol is unclear. All of this runs inside one WebSocket at wss://agents.assemblyai.com/v1/ws, with audit logging, encrypted transcripts, and a BAA executed with AssemblyAI before any PHI is processed.

Can I use the Voice Agent API for healthcare workflows subject to HIPAA?

AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum (BAA) for customers processing PHI. Before processing any PHI you need to execute the BAA with AssemblyAI — contact our sales team. The Voice Agent API uses TLS for transit, supports PII redaction, and provides per-session audit logs. Your application also needs its own architecture aligned to HIPAA — encryption at rest, role-based access controls, audit logging, retention policies — to meet your obligations end-to-end.

Can a telehealth voice agent diagnose patients?

No. A telehealth triage voice agent should never diagnose, prescribe, or provide clinical decisions. Its role is to capture symptoms, run a defined triage protocol developed by licensed clinicians, score severity, and route the patient to the appropriate care level — emergency, urgent care, scheduled visit, or self-care. A human clinician (nurse, physician, NP) makes the final clinical decision. The system prompt should explicitly forbid diagnostic statements ("you have..." — never; "your symptoms suggest..." — only when leading into a routing decision).

How does the Voice Agent API handle medical terminology?

The STT layer under the Voice Agent API is Universal-3 Pro Streaming, which performs well on conversational medical terminology like medication names and common conditions. For billing-grade clinical documentation — SOAP notes, ICD-10 candidate coding, structured chart entries — AssemblyAI's separate Medical Mode async API is purpose-built for clinical accuracy. Enable it with domain="medical-v1" on a pre-recorded transcript request. The common architecture is: real-time triage on the Voice Agent API, post-call documentation through Medical Mode async, both under the same BAA.

What happens when the agent encounters a red flag?

When the agent detects a red flag — cardiac symptoms (chest pain, arm pain, shortness of breath), stroke symptoms (facial drooping, slurred speech, weakness), surgical symptoms (severe abdominal pain), sepsis indicators (high fever with stiff neck), or mental health emergencies (suicidal ideation) — it should immediately call escalate_to_nurse with the reason, tell the patient "These symptoms need immediate attention. I'm connecting you to our on-call nurse right now," and hand off the call along with the captured symptoms. Red-flag escalation must be automatic, not conditional. Never let the agent continue triaging after a red flag is captured.

What's the difference between this and a healthcare scheduling voice agent?

A healthcare scheduling voice agent books appointments, verifies insurance, and handles prescription refills — administrative tasks where the worst-case error is a rescheduled appointment. A telehealth triage voice agent captures clinical symptoms and routes to care levels — clinical tasks where the worst-case error is a missed cardiac event. The two have different risk profiles, different prompts, different tools, and different review processes. A team building both should keep them as separate agents with separate audit trails. Our healthcare voice agents guide covers the scheduling/administrative side.

How to add automatic LLM fallbacks to your voice pipeline

Mart Schweiger — Tue, 12 May 2026 18:01:08 +0000

Your voice agent is mid-conversation when Anthropic's API returns a 529 overloaded error. The user is waiting. Your code throws. The call drops.

This is the failure mode most voice pipelines aren't built for—and it's getting worse, not better. As more applications move to a single LLM provider, a regional outage at any one of them stalls every downstream voice agent that depends on it. The fix isn't more retries on the same model; it's an automatic switch to a different one.

This tutorial walks you through adding automatic LLM fallbacks to a voice pipeline using AssemblyAI's LLM Gateway. With one extra parameter in your request, the Gateway will automatically retry failed calls on a backup model—Claude to Gemini to GPT—without you writing a line of retry logic. By the end, you'll have a runnable Python pipeline that transcribes live audio with Universal-3 Pro Streaming, routes the transcript through a primary LLM with a fallback chain, and stays online when any single provider does not.

Why fallbacks matter more for voice than for chat

In a chat app, an LLM error means a spinner and a retry button. In a Voice AI pipeline, it means dead air. The user is on the phone, waiting for a response, and a five-second silence while you reconnect to a different provider already feels like a hang-up.

Three failure modes that fallbacks solve:

Provider rate limits. OpenAI, Anthropic, and Google all enforce per-account TPM (tokens per minute) ceilings. A traffic spike on a Monday morning sales line can blow through your default tier before lunch.
Regional outages. Provider status pages show a real distribution of multi-hour incidents per quarter. If your only LLM call is to a single model, your uptime is capped at theirs.
Model deprecations. A model gets sunset on short notice. Without a fallback configured, every voice session that hits the deprecated model fails until you ship a code change.

LLM Gateway sits in front of every supported provider. You point your client at one endpoint, specify a primary model, and list one or two fallbacks. When the primary fails—overloaded, rate-limited, or unavailable—the Gateway transparently retries on the next model in line and returns the response as if nothing went wrong.

What you'll build

A Python voice pipeline that:

Streams microphone audio to AssemblyAI's Universal-3 Pro streaming speech-to-text model
On end-of-turn, sends the final transcript to LLM Gateway with kimi-k2.5 as the primary model and claude-sonnet-4-6 as the fallback
Prints the agent's response—and logs which model actually handled the call

You'll also see how to chain multiple fallbacks, override prompts per fallback model, and tune retry behavior.

Stack:

AssemblyAI Universal-3 Pro Streaming (speech-to-text)
AssemblyAI LLM Gateway (LLM routing with fallbacks)
Python 3.9+

Setup

Install the dependencies:

pip install assemblyai requests python-dotenv pyaudio

Create a .env file with your API key:

ASSEMBLYAI_API_KEY=your_key_here

You only need one key. The same key authenticates both the streaming STT WebSocket and the LLM Gateway endpoint—no separate accounts with OpenAI, Anthropic, or Google required.

Step 1: Connect to Universal-3 Pro Streaming

For a voice agent, you want the lowest-latency path from speech to text, then immediately hand the transcript to the LLM. We'll use AssemblyAI's v3 streaming API, which returns immutable final transcripts in roughly 300ms.

import os
from dotenv import load_dotenv
from assemblyai.streaming.v3 import (
    StreamingClient,
    StreamingClientOptions,
    StreamingParameters,
    StreamingEvents,
    BeginEvent,
    TurnEvent,
    TerminationEvent,
    StreamingError,
)

load_dotenv()
ASSEMBLYAI_API_KEY = os.getenv("ASSEMBLYAI_API_KEY")

def on_begin(client: StreamingClient, event: BeginEvent):
    print(f"Session started: {event.id}")

def on_turn(client: StreamingClient, event: TurnEvent):
    if event.end_of_turn:
        print(f"\nUser: {event.transcript}")
        respond_with_fallback(event.transcript)
    else:
        print(f"\rPartial: {event.transcript}", end="")

def on_error(client: StreamingClient, error: StreamingError):
    print(f"STT error: {error}")

def on_terminated(client: StreamingClient, event: TerminationEvent):
    print("Session terminated")

The on_turn handler is where the LLM call happens. Every time the user finishes speaking, we hand the final transcript to respond_with_fallback—the function we're about to define.

Step 2: Add the fallback chain

Here's the part that matters. A standard chat completions request looks like this:

def respond_with_fallback(user_text: str):
    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers={"authorization": ASSEMBLYAI_API_KEY},
        json={
            "model": "kimi-k2.5",  # Primary: fast, low-latency
            "messages": [
                {"role": "system", "content": "You are a helpful voice assistant. Keep responses to one or two short sentences."},
                {"role": "user", "content": user_text},
            ],
            "max_tokens": 200,
            "fallbacks": [
                {"model": "claude-sonnet-4-6"},   # First fallback
                {"model": "gemini-2.5-flash"},    # Second fallback
            ],
            "fallback_config": {"depth": 2},      # Try up to two fallbacks
        },
        timeout=10,
    )

    if response.status_code != 200:
        print(f"All models failed: {response.text}")
        return

    result = response.json()
    actual_model = result.get("model")
    reply = result["choices"][0]["message"]["content"]

    print(f"Agent ({actual_model}): {reply}")
    return reply

A few details that matter for production:

The model field in the response reflects which model actually answered. If your primary failed and the Gateway used Claude instead, you'll see claude-sonnet-4-6 in the response—and you'll only be billed for that model.
Without a fallbacks array, the Gateway still does one automatic retry on the primary after 500ms (default fallback_config.retry: true). That handles transient blips. The fallback array handles outright failures.
fallback_config.depth controls how many fallbacks to try. Setting it to 2 means the Gateway will try the primary, then the first fallback, then the second.

Step 3: Choose the right primary and fallback models

Latency and capability vary widely across providers. For voice, you want a fast primary because the user is waiting in real time, and a more reliable secondary in case the fast one is overloaded.

Pulled from the LLM Gateway model list, here are sensible voice agent pairings:

Use case	Primary	Fallback 1	Fallback 2	Why
Latency-critical (phone agent)	kimi-k2.5 (~1.2s)	gemini-2.5-flash-lite (~1.1s)	gpt-5-nano (~3.2s)	All low latency; different providers
Quality-first (clinical, legal)	claude-sonnet-4-6	gemini-2.5-pro	gpt-5.1	Highest quality models in each provider
Balanced (most consumer apps)	gpt-5.2 (~1.6s)	claude-haiku-4-5-20251001 (~4.1s)	kimi-k2.5	Speed + cross-provider redundancy

The key constraint: your fallbacks should be on different providers from the primary. A Claude Sonnet to Claude Haiku fallback won't help during an Anthropic outage—both calls hit the same upstream.

Step 4: Override fields per fallback

Sometimes a fallback model needs a different prompt. Maybe your primary uses a 4,000-token system prompt that your cheaper fallback doesn't have the context window for. Or you want the fallback to be more concise to keep latency in check.

LLM Gateway lets you override any request field per fallback:

"fallbacks": [
    {
        "model": "claude-sonnet-4-6",
        "messages": [
            {"role": "system", "content": "Be very concise. One sentence max."},
            {"role": "user", "content": user_text},
        ],
        "max_tokens": 80,
    },
],

Any field you don't override is inherited from the original request. This is especially useful when your primary is tuned with a long, detailed system prompt and you want a stripped-down version on the backup.

Step 5: Putting it all together

Wire the streaming client to your fallback-enabled response function:

import assemblyai as aai

def main():
    client = StreamingClient(
        StreamingClientOptions(
            api_key=ASSEMBLYAI_API_KEY,
            api_host="streaming.assemblyai.com",
        )
    )
    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)

    client.connect(
        StreamingParameters(
            speech_model="u3-rt-pro",
            sample_rate=16000,
        )
    )

    try:
        client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
    finally:
        client.disconnect(terminate=True)

if __name__ == "__main__":
    main()

Run it, speak into your microphone, and watch the printed model name. Then—if you want to see the fallback fire—set the primary model parameter to a deliberately invalid string like "this-model-does-not-exist". The Gateway will fail the primary, immediately route to your first fallback, and return a normal response with the fallback model name in the output.

What this gets you in production

Three changes to your voice pipeline as soon as fallbacks are in place:

Provider outages stop being your incidents. When Anthropic, OpenAI, or Google has a regional issue, your voice sessions keep flowing—they just route through whichever provider is healthy. You don't get paged.
Rate-limit spikes self-heal. A traffic spike that would have hit your TPM ceiling on the primary now spreads across providers automatically.
Model migrations are zero-downtime. When a new model ships, you can flip the primary to the new model and keep the old one as a fallback. If anything goes wrong, traffic falls back automatically while you debug.

You can layer more on top of this—separate fallback chains per use case, EU-resident endpoints for GDPR compliance, prompt caching to amortize cost—but the single fallbacks array gets you 90% of the resilience for two extra lines of JSON.

What to build next

Pair fallbacks with streaming chat completions so the user hears the first sentence while the LLM is still generating the rest.
Add tool calling to let your voice agent look up orders, schedule callbacks, or transfer to a human—same fallback behavior carries through.
Consolidate to one API. If you're managing this on top of a separate STT provider, AssemblyAI's Voice Agent API bundles speech understanding, LLM reasoning, and voice generation into a single WebSocket—same fallback patterns apply at the LLM layer, and there's nothing to wire together.

Voice agents need to be built for the failures that will actually happen, not the happy path. Fallbacks turn LLM availability from a single point of failure into a non-event.

Frequently asked questions

What is an LLM fallback and why does my voice pipeline need one?

An LLM fallback is a backup model that automatically takes over when your primary model fails—whether from a provider outage, rate limit, or transient error. Voice pipelines need fallbacks because a failed LLM call means dead air on a live call, which is much worse than a failed text request that the user can retry. With AssemblyAI's LLM Gateway, you specify a fallbacks array in your request and the Gateway transparently retries on the next model if the primary fails—no custom retry logic required.

How does AssemblyAI's LLM Gateway handle automatic LLM failover?

LLM Gateway accepts a fallbacks array of up to two backup models per request. If the primary model fails, the Gateway automatically retries the request with the first fallback, then the second, until one succeeds. The response payload reflects the model that actually answered, and you're billed only for that model. By default, the Gateway also performs one automatic retry on the primary after 500 ms to handle transient errors before falling back to a different provider.

Which LLM providers does AssemblyAI's LLM Gateway support for fallback chains?

LLM Gateway supports 25+ models across Anthropic Claude (Opus 4.7, Sonnet 4.6, Haiku 4.5), OpenAI GPT (GPT-5.2, 5.1, 5, 4.1, mini, nano, gpt-oss), Google Gemini (3 Flash Preview, 2.5 Pro/Flash/Flash-Lite), Alibaba Cloud Qwen, and Moonshot AI Kimi. For voice fallback chains, the key constraint is to chain across different providers—a Claude to Claude fallback won't help during an Anthropic outage because both calls hit the same upstream.

How do I add automatic LLM fallbacks to my voice pipeline?

Add a fallbacks array to your chat/completions request body—that's it. The Gateway handles retries, model switching, and billing automatically. A typical voice agent pairing is kimi-k2.5 as the primary (~1.2s latency), claude-sonnet-4-6 as the first fallback for higher quality, and gemini-2.5-flash-lite as a second fallback for additional provider redundancy. Set fallback_config.depth: 2 to use both backups.

Can I customize the prompt or temperature for each fallback model?

Yes—LLM Gateway lets you override any request field per fallback. This is useful when your primary uses a long, detailed system prompt that a smaller fallback can't accommodate, or when you want the fallback to be more concise to keep latency in check. Any field you don't override on the fallback is inherited from the original request, so you only need to specify what changes.

How does billing work when an LLM Gateway request falls back to a different model?

You're charged only for the model that actually returned the response, at that model's per-token rate. If your primary fails and the Gateway retries with a fallback, you pay only for the fallback model's tokens—not for the failed primary attempt. All usage shows up on a single AssemblyAI invoice across providers, with no markup on top of model rates.

What's the difference between LLM Gateway fallbacks and writing my own retry logic?

LLM Gateway fallbacks handle the entire retry-and-route flow inside the Gateway, so your application code makes one request and gets one response—no custom timeout handling, no model-switching logic, no per-provider error mapping. Writing it yourself works for chat apps where a few seconds of retry latency is fine, but in a voice pipeline every second of dead air costs you, and built-in fallbacks fire faster than client-side retries because the Gateway is already inside the network path.

AssemblyAI LLM Gateway vs. OpenRouter vs. LLM Gateway.io: Pricing, security, and reliability compared

Mart Schweiger — Tue, 12 May 2026 18:00:32 +0000

Picking an LLM gateway used to be a niche infrastructure decision. In 2026, it's table stakes for any team running production AI workloads—especially voice agents, where a single provider outage means dead air on a live call.

Three names come up over and over again in this evaluation: AssemblyAI's LLM Gateway, OpenRouter, and LLM Gateway.io. They sound similar on the surface—all three give you a single API for routing requests across Claude, GPT, Gemini, and other major providers—but they're built for different workloads and they price, fail over, and handle data very differently.

This post compares the three head-to-head on the dimensions that actually matter when you're shipping: pricing model, reliability features, security posture, model coverage, and developer experience. By the end, you'll know which one fits your stack—and where the cheap-on-paper option will cost you more downstream.

Quick verdict

If you're building...

Voice agents, AI scribes, meeting tools, or anything on top of audio
AssemblyAI LLM Gateway — speech-native context, one billing relationship, sits next to your STT

A general-purpose LLM app, side project, or model marketplace UI
OpenRouter — widest model selection (300+), BYO-key option, strong for experimentation

A self-hosted gateway you fully control, with custom routing logic
LLM Gateway.io — open-source, self-hostable, maximum customization

The rest of this post unpacks why.

What each one actually is

AssemblyAI LLM Gateway

A managed, OpenAI-compatible chat completions API that routes to 25+ models across Anthropic, OpenAI, Google, Alibaba Cloud Qwen, and Moonshot AI Kimi. Available at llm-gateway.assemblyai.com/v1/chat/completions (US) or llm-gateway.eu.assemblyai.com/v1/chat/completions (EU). Built specifically for Voice AI workloads—designed to take transcripts from AssemblyAI's Universal-3 Pro Streaming or pre-recorded models and apply LLMs to them with native preservation of speaker labels, timestamps, and conversation structure.

Best fit: teams already using AssemblyAI for transcription, or any team building voice agents, conversation intelligence, AI medical scribes, or audio analytics.

OpenRouter

A model marketplace that aggregates 300+ models from dozens of providers behind a single OpenAI-compatible endpoint. OpenRouter operates as a billing intermediary—you pay OpenRouter, OpenRouter pays the upstream provider—typically at a small markup over direct API rates, with bring-your-own-API-key supported on most models for users who want to bypass the markup.

Best fit: general-purpose LLM applications, hobbyist and prosumer use cases, and teams that want access to long-tail or specialized open-source models that other gateways don't carry.

LLM Gateway.io

An open-source LLM gateway that you can self-host or use through their managed cloud. Focuses on infrastructure-level features: custom routing rules, observability, caching, rate limiting, and budget controls. Less of a marketplace and more of a control plane you put in front of your LLM traffic.

Best fit: teams with strict deployment requirements (air-gapped, on-prem, regulated industries) or teams that need deep customization of routing logic and want to own the infrastructure.

Pricing, head-to-head

This is where the differences are sharpest—and where the cheapest sticker price isn't always the cheapest total cost.

	AssemblyAI LLM Gateway	OpenRouter	LLM Gateway.io
Markup over provider rates	None — pay model-specific rates	Small markup on most models (BYOK avoids it)	None when self-hosted; managed plan has its own pricing
Billing	Unified with your AssemblyAI account (single invoice)	Separate OpenRouter account	Separate or self-hosted
Free tier	Yes — $50 in starter credits	Yes — limited free models	Open-source is free; managed has tiers
Volume discounts	Available via custom plans	Limited	Self-hosted: scale at infrastructure cost
Hidden costs to watch	None obvious	BYOK still pays small platform fee on some providers	Self-hosted ops overhead (hosting, monitoring, scaling)

The quiet cost of OpenRouter for high-volume production traffic is the per-token markup, which compounds across millions of tokens. The quiet cost of self-hosting LLM Gateway.io is the engineering time to keep it healthy. AssemblyAI's pricing is the most predictable: model-list rate, no markup, one bill.

For voice workloads specifically, the bigger pricing story is what's not on this table. If you're already paying for speech-to-text, LLM Gateway adds the LLM layer on the same bill—no second vendor relationship, no separate procurement.

Model coverage

	AssemblyAI LLM Gateway	OpenRouter	LLM Gateway.io
Total models	25+	300+	Whatever you configure
Anthropic Claude	All major models (Opus 4.7, Sonnet 4.6, Haiku 4.5)	All major models	Yes (BYO)
OpenAI GPT	GPT-5.2, 5.1, 5, 4.1, GPT-5 mini/nano, gpt-oss	All major models	Yes (BYO)
Google Gemini	Gemini 3 Flash Preview, 2.5 Pro/Flash/Flash-Lite	All major Gemini models	Yes (BYO)
Open-source / specialty	Qwen3, Kimi K2.5, gpt-oss	Long tail (Mistral, Llama variants, Cohere, fine-tunes, etc.)	Yes (BYO)
New model availability	Same week as upstream release in most cases	Within hours-days	Depends on your config

OpenRouter wins on raw breadth—if you need an obscure fine-tune or a specific open-source variant, it's there. AssemblyAI's lineup is curated to the production-grade frontier and best-of-class fast models, which is what almost every voice agent or audio app actually needs. LLM Gateway.io, being the gateway layer rather than the model layer, gives you whatever you wire up.

Reliability features

For voice and real-time use cases, this is the table that matters most.

	AssemblyAI LLM Gateway	OpenRouter	LLM Gateway.io
Automatic fallback to backup model	Yes — built-in fallbacks array, up to 2 backups	Yes — fallback model parameter	Yes — configurable routing rules
Retry on transient failure	Yes — automatic 500ms retry by default	Yes	Yes (configurable)
Per-fallback field overrides	Yes — override prompt, temp, max_tokens per backup	Limited	Yes (custom logic)
Streaming support	Yes (OpenAI models)	Yes	Yes
Prompt caching	Yes — Anthropic and OpenAI caching supported	Provider-dependent	Provider-dependent
Multi-region failover	US + EU endpoints	Single global endpoint	Whatever you build

AssemblyAI's fallback model is worth a closer look. You can specify a chain of up to two backup models; if your primary fails, the Gateway transparently retries the next model in line and returns the response as if nothing happened. The response payload includes the actual model that handled the request, and you're only billed for that model. For voice pipelines where every second of dead air costs you, this is the feature that turns LLM availability from a single point of failure into a non-event.

OpenRouter's fallback support is similar in concept but implemented differently—you specify fallbacks at the request level and the platform handles routing. LLM Gateway.io gives you the most flexibility because you write the routing logic, but that flexibility is also work.

Security and compliance

	AssemblyAI LLM Gateway	OpenRouter	LLM Gateway.io
SOC 2 Type 2	Yes	Yes	Self-hosted: depends on your setup
HIPAA BAA available	Yes	Limited (varies by provider)	Self-hosted: yours to maintain
EU data residency	Yes — dedicated EU endpoint	No dedicated EU endpoint	Self-hosted: yours to deploy
PCI DSS v4.0	Yes	No	Self-hosted: yours to certify
ISO 27001:2022	Yes	Limited	Self-hosted: yours to certify
Data retention controls	Configurable; opt-out of training	Provider-dependent	You control everything

For regulated industries—healthcare, financial services, legal—the compliance story is the deciding factor. AssemblyAI offers a Business Associate Agreement for HIPAA workloads and is SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certified. The EU endpoint guarantees data never leaves the European Union, which matters under GDPR.

OpenRouter's compliance posture is thinner—it's a marketplace, and the underlying compliance ultimately depends on the provider you route to. LLM Gateway.io self-hosted shifts every compliance burden onto your team, which is either a feature (full control) or a bug (full responsibility) depending on your org.

Voice and audio: where the real differences show up

This is where AssemblyAI's gateway separates from the others, and the comparison stops being symmetric.

Speech-native context preservation. When you pass an AssemblyAI transcript to LLM Gateway, speaker labels, timestamps, and conversation structure are preserved in the prompt automatically. You don't flatten the transcript; the model receives the structured speech data. Generic LLM gateways can't do this because they're not aware of the upstream STT.

Same-account billing with transcription. If you're already using AssemblyAI for STT or the Voice Agent API, every LLM call shows up on the same invoice. No reconciling tokens with minutes-of-audio across two vendors.

Streaming integration. AssemblyAI's streaming API returns final transcripts in roughly 300 ms; you can hand each segment to LLM Gateway in real time for live summarization, translation, sentiment tagging, or agentic logic—no separate pipeline.

Built for audio-specific workloads. Meeting summarization, action item extraction, SOAP note generation for ambient AI scribes, sales call analytics, real-time translation—these are all first-class patterns in the docs and they work the same way you'd expect a chat completion to work.

OpenRouter and LLM Gateway.io can technically do all of this—you just have to glue the audio side together yourself. For one or two endpoints, that's fine. For a production voice product with complex prompts, multiple LLM tasks per call, and tight latency budgets, the integrated path saves real engineering time.

Developer experience

	AssemblyAI LLM Gateway	OpenRouter	LLM Gateway.io
API compatibility	OpenAI-compatible chat completions	OpenAI-compatible	OpenAI-compatible
Auth	Single AssemblyAI API key	OpenRouter key (or BYOK)	Self-managed
SDKs / docs	Official AssemblyAI SDKs (Python, Node, .NET, Java, etc.) + docs	Their own SDK + community libraries	Open-source repo + docs
Playground	Yes — test models side-by-side	Yes	Self-hosted only
Setup time	Minutes (just swap the base URL)	Minutes	Hours-days for self-host
Migration friction	Same OpenAI-compatible request schema	Same OpenAI-compatible request schema	Same OpenAI-compatible request schema

All three are easy to adopt because they all speak the same chat completions schema. Switching from one to another requires changing a base URL and an API key—not a rewrite. That's the right way to think about lock-in: low.

When to pick each one

Pick AssemblyAI LLM Gateway if:

You're building voice agents, AI scribes, conversation intelligence, or any audio-first product
You're already using AssemblyAI for transcription and want to consolidate
You need a BAA for HIPAA workloads, EU data residency, or PCI compliance
You want predictable pricing without per-token markups
You want fallbacks, prompt caching, and EU/US endpoints out of the box

Pick OpenRouter if:

You're building a chat app, agent product, or general LLM tool unrelated to audio
You need access to a long tail of open-source or specialty models
You want to experiment across many models before committing
You're a hobbyist or prosumer who values selection over enterprise compliance

Pick LLM Gateway.io if:

You have hard requirements to self-host or run air-gapped
You need to write custom routing logic (e.g., regulatory rules, cost-aware routing across BYO accounts)
You have engineering capacity to operate the infrastructure
You're standardizing across many internal teams and want one control plane

The hidden tradeoff

The real question isn't "which gateway has the most features." It's "which one will I regret picking in six months when my workload doubles."

For voice and audio workloads, that answer is almost always the gateway that's natively integrated with your speech stack. The marginal latency, the speech-aware context, the unified billing, the compliance—all of it adds up to engineering hours you don't spend wiring two vendors together.

Frequently asked questions

What is an LLM gateway and why would I use one?

An LLM gateway is a routing layer that sits between your application and multiple LLM providers, giving you one API endpoint for Claude, GPT, Gemini, and other models. You'd use one to avoid vendor lock-in, add automatic failover when a provider has an outage, unify billing across models, and switch models without rewriting client code. AssemblyAI's LLM Gateway, OpenRouter, and LLM Gateway.io are the three main options—they serve different workloads and price differently.

What's the difference between AssemblyAI's LLM Gateway and OpenRouter?

"AssemblyAI's LLM Gateway is purpose-built for Voice AI workloads—it natively preserves speaker labels, timestamps, and conversation structure when you pass transcripts." OpenRouter serves as a general-purpose model marketplace that aggregates 300+ models with a per-token markup. For voice agents, AI scribes, and audio applications, the integrated approach offers advantages in handling speech context and unified billing.

Which LLM gateway is best for voice agents?

AssemblyAI's LLM Gateway represents the strongest fit for voice agents because it integrates with Universal-3 Pro Streaming and the Voice Agent API through the same WebSocket layer. This configuration provides unified authentication, combined billing, automatic fallbacks across providers, and native speech context preservation—advantages that generic gateways require additional engineering to achieve.

How does LLM Gateway pricing compare to calling LLM providers directly?

AssemblyAI's LLM Gateway charges model-specific rates with no markup, billed through your AssemblyAI account. OpenRouter adds a small per-token platform fee, though their bring-your-own-API-key option can reduce this. LLM Gateway.io remains free as open-source software when self-hosted, with infrastructure costs your team absorbs, or users can opt for their managed tier. For high-volume production, AssemblyAI and self-hosted LLM Gateway.io provide the most predictable cost structures.

Does AssemblyAI's LLM Gateway support EU data residency and HIPAA compliance?

Yes—a dedicated EU endpoint at llm-gateway.eu.assemblyai.com/v1/chat/completions keeps all request and response data inside the European Union, supporting Anthropic Claude and most Google Gemini models. AssemblyAI provides a Business Associate Agreement for HIPAA workloads and maintains SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certification, representing the strictest compliance posture among the three platforms.

Can I switch between LLM gateways without rewriting my code?

Yes—all three gateways use OpenAI-compatible chat completions schemas, so switching typically requires changing only the base URL and API key. This means lock-in remains low; you can evaluate one platform against another and migrate without rewriting application code. Moving from direct OpenAI integration to any of these gateways involves similarly minimal changes.

Which LLM gateway should I use for HIPAA-regulated healthcare apps?

AssemblyAI's LLM Gateway represents the most straightforward choice for HIPAA workloads since the company offers a Business Associate Agreement and operates SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0-certified infrastructure. For data isolation beyond BAA scope, LLM Gateway.io self-hosted provides complete deployment control but requires your team to maintain compliance certification. OpenRouter generally misaligns with regulated healthcare data requirements due to variable compliance support across upstream providers.

Stream LLM responses in a voice pipeline: Tool calling, structured outputs, and real-time actions

Mart Schweiger — Tue, 12 May 2026 17:59:55 +0000

When a user finishes a sentence in a voice conversation, they expect to hear the agent start replying within roughly a second. Anything longer feels broken. The fastest way to hit that target isn't a faster LLM—it's not waiting for the LLM to finish before you start speaking.

Streaming the LLM response, sentence by sentence, into a TTS engine is the trick that turns a 4-second response time into a sub-second one. And once you're streaming, you can layer on tool calling for real-world actions and structured outputs for predictable downstream code—all without giving up that latency budget.

This tutorial walks through how to build that pipeline using AssemblyAI's LLM Gateway and Universal-3 Pro Streaming. By the end, you'll have a Python voice pipeline that:

Streams microphone audio into AssemblyAI for live transcription
Streams the LLM response token-by-token through LLM Gateway
Calls tools mid-conversation to look up data or trigger actions
Returns structured JSON when the workflow needs predictable output
Hands each completed sentence to TTS as it arrives

Why streaming matters more in voice than in chat

In a chat UI, streaming is a nice-to-have—you see the response appear word by word instead of all at once. In a voice agent, it's the difference between conversational and broken.

The math is simple. End-to-end voice latency is roughly:

STT finalization (200-500 ms)

+ LLM time-to-first-token (150-400 ms)

+ TTS time-to-first-audio (200-400 ms)

+ network overhead (50-150 ms)

= 600-1500 ms before the user hears anything

If you wait for the full LLM response before sending text to TTS, add another 1-3 seconds onto that. Users notice. Conversation breaks. They start over.

If you stream—flushing each completed sentence to TTS as soon as the LLM emits it—the user hears the first sentence while the LLM is still generating the second. End-to-end latency stays inside the 600-900 ms range that feels conversational.

What you'll build

A Python pipeline that handles three voice agent patterns:

Streamed conversational replies—the user asks a question; the agent's voice starts within ~1 second and flows naturally
Tool calling—the user says "what's my order status?"; the agent calls get_order_status(order_id) and speaks the result
Structured outputs—the agent returns a JSON object matching a schema (e.g., {intent, urgency, escalate}), which your code consumes directly without parsing freeform text

Stack:

AssemblyAI Universal-3 Pro Streaming (speech-to-text)
AssemblyAI LLM Gateway (streaming chat completions, tools, structured outputs)
A TTS engine of your choice (we'll use a placeholder—same pattern works with any streaming TTS)
Python 3.9+

Setup

pip install assemblyai requests python-dotenv pyaudio

Create .env:

ASSEMBLYAI_API_KEY=your_key_here

The same AssemblyAI API key authenticates both the streaming STT WebSocket and the LLM Gateway endpoint.

Step 1: Stream tokens from LLM Gateway

LLM Gateway supports OpenAI-style streaming on OpenAI models. Set stream: True in the request and read the response as a Server-Sent Events (SSE) stream. Each chunk contains a partial token; you stitch them together as they arrive.

The key trick for voice: don't wait for the full response. Buffer tokens, watch for sentence boundaries (., !, ?), and flush each completed sentence to TTS the instant it's ready.

import os
import json
import requests
from dotenv import load_dotenv

load_dotenv()
ASSEMBLYAI_API_KEY = os.getenv("ASSEMBLYAI_API_KEY")

def stream_llm_response(user_text: str):
    """
    Stream the LLM response. Yield each completed sentence as it's generated.
    """
    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers={"authorization": ASSEMBLYAI_API_KEY},
        json={
            "model": "gpt-5.2",
            "messages": [
                {"role": "system", "content": "You are a friendly voice assistant. Keep replies short."},
                {"role": "user", "content": user_text},
            ],
            "stream": True,
            "max_tokens": 300,
        },
        stream=True,
        timeout=15,
    )

    buffer = ""
    sentence_endings = (".", "!", "?")

    for line in response.iter_lines():
        if not line:
            continue
        line = line.decode("utf-8")
        if not line.startswith("data: "):
            continue
        data = line[6:]
        if data == "[DONE]":
            if buffer.strip():
                yield buffer.strip()
            return

        chunk = json.loads(data)
        delta = chunk["choices"][0]["delta"].get("content", "")
        if not delta:
            continue
        buffer += delta

        while any(buffer.find(p) != -1 for p in sentence_endings):
            split_idx = max(buffer.rfind(p) for p in sentence_endings)
            sentence = buffer[: split_idx + 1].strip()
            buffer = buffer[split_idx + 1 :]
            if sentence:
                yield sentence

The generator yields each completed sentence as it's ready. Your TTS engine consumes these one at a time:

def speak(sentence: str):
    """Send a sentence to your TTS engine. Replace with your provider's API."""
    print(f"  [TTS] {sentence}")
    # tts_client.stream(sentence)

def handle_final_transcript(user_text: str):
    print(f"User: {user_text}")
    for sentence in stream_llm_response(user_text):
        speak(sentence)

This single change—yielding sentences as they arrive instead of waiting for the full reply—typically cuts perceived response time by 60-80% for any reply longer than two sentences.

Step 2: Add tool calling

Voice agents become useful the moment they can do something—look up an order, check inventory, schedule a callback, transfer to a human. LLM Gateway supports OpenAI-compatible tool calling across every supported model (Claude, OpenAI, Gemini), so you write the code once and it works no matter which provider you route to.

Define your tools:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up the status of a customer order by order ID.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The order ID, e.g. ORD-12345",
                    }
                },
                "required": ["order_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "schedule_callback",
            "description": "Schedule a callback with a sales rep.",
            "parameters": {
                "type": "object",
                "properties": {
                    "phone_number": {"type": "string"},
                    "preferred_time": {"type": "string"},
                },
                "required": ["phone_number", "preferred_time"],
            },
        },
    },
]

Implement the actual functions:

def get_order_status(order_id: str) -> dict:
    return {"order_id": order_id, "status": "shipped", "eta": "2026-05-09"}

def schedule_callback(phone_number: str, preferred_time: str) -> dict:
    return {"confirmation": "CB-9982", "phone": phone_number, "time": preferred_time}

TOOL_REGISTRY = {
    "get_order_status": get_order_status,
    "schedule_callback": schedule_callback,
}

Now extend the LLM call to handle tool requests. The Gateway returns a tool_calls field on the assistant message; you execute each tool, append the result to the conversation history, and call again to let the model produce its spoken response:

def stream_llm_response_with_history(history: list, model: str = "gpt-5.2"):
    """Stream a follow-up reply using the existing conversation history."""
    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers={"authorization": ASSEMBLYAI_API_KEY},
        json={"model": model, "messages": history, "stream": True, "max_tokens": 300},
        stream=True,
        timeout=15,
    )

    buffer = ""
    sentence_endings = (".", "!", "?")
    for line in response.iter_lines():
        if not line:
            continue
        line = line.decode("utf-8")
        if not line.startswith("data: "):
            continue
        data = line[6:]
        if data == "[DONE]":
            if buffer.strip():
                yield buffer.strip()
            return
        chunk = json.loads(data)
        delta = chunk["choices"][0]["delta"].get("content", "")
        if not delta:
            continue
        buffer += delta
        while any(p in buffer for p in sentence_endings):
            split_idx = max(buffer.rfind(p) for p in sentence_endings)
            sentence = buffer[: split_idx + 1].strip()
            buffer = buffer[split_idx + 1 :]
            if sentence:
                yield sentence

def respond_with_tools(user_text: str, history: list):
    history.append({"role": "user", "content": user_text})

    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers={"authorization": ASSEMBLYAI_API_KEY},
        json={
            "model": "claude-sonnet-4-6",
            "messages": history,
            "tools": tools,
            "max_tokens": 500,
        },
    ).json()

    message = response["choices"][0]["message"]
    history.append(message)

    if message.get("tool_calls"):
        for tool_call in message["tool_calls"]:
            fn_name = tool_call["function"]["name"]
            fn_args = json.loads(tool_call["function"]["arguments"])
            result = TOOL_REGISTRY[fn_name](**fn_args)

            history.append({
                "role": "tool",
                "tool_call_id": tool_call["id"],
                "content": json.dumps(result),
            })

        return stream_llm_response_with_history(history, model="gpt-5.2")

    def _yield_once():
        yield message["content"]
    return _yield_once()

stream_llm_response_with_history is the same streaming function from Step 1, except it sends the full conversation history (which now includes the tool result) so the model can speak the answer naturally.

The clean part: tool calling and streaming compose. The model thinks for a moment ("let me check that for you"), executes the tool, and then streams the spoken result token by token—exactly the conversational rhythm users expect.

A note for entity-heavy use cases: if your tool parameters include order IDs, phone numbers, or email addresses, your speech-to-text accuracy on those tokens is what determines whether tool calls succeed. Universal-3 Pro Streaming has roughly 16.7% mixed-entity error rate vs. 23-25% for competing models—that's the difference between ORD-12345 and or 12 three 45 getting passed to your function.

Step 3: Use structured outputs for predictable JSON

Sometimes you don't want a spoken reply—you want machine-readable output your downstream code can act on. Routing decisions, intent classification, sentiment scoring, escalation flags. LLM Gateway supports structured outputs via JSON schema, which guarantees the model returns exactly the shape you specified.

Define the schema:

classification_schema = {
    "type": "object",
    "properties": {
        "intent": {
            "type": "string",
            "enum": ["billing", "support", "sales", "cancel", "other"],
        },
        "urgency": {
            "type": "string",
            "enum": ["low", "medium", "high"],
        },
        "escalate": {"type": "boolean"},
        "summary": {"type": "string"},
    },
    "required": ["intent", "urgency", "escalate", "summary"],
}

Send it with response_format:

def classify_utterance(user_text: str) -> dict:
    response = requests.post(
        "https://llm-gateway.assemblyai.com/v1/chat/completions",
        headers={"authorization": ASSEMBLYAI_API_KEY},
        json={
            "model": "gpt-5.2",
            "messages": [
                {"role": "system", "content": "Classify the user's intent for a customer service workflow."},
                {"role": "user", "content": user_text},
            ],
            "response_format": {
                "type": "json_schema",
                "json_schema": {
                    "name": "intent_classification",
                    "schema": classification_schema,
                    "strict": True,
                },
            },
        },
    ).json()

    return json.loads(response["choices"][0]["message"]["content"])

classification = classify_utterance("I want to cancel my subscription right now.")
# {"intent": "cancel", "urgency": "high", "escalate": True, "summary": "..."}

if classification["escalate"]:
    transfer_to_human()

You get back a parsed dict you can route on directly. Pair this with streaming for the user-facing reply: classify the intent (structured), then stream a conversational acknowledgment based on the classification. The user hears a friendly sentence in under a second while your code routes the call in the background.

Step 4: Wire it all together with streaming STT

The full pipeline looks like this—STT WebSocket on the inbound side, streamed LLM Gateway responses on the outbound side, with tool calls and structured outputs available when needed:

from assemblyai.streaming.v3 import (
    StreamingClient,
    StreamingClientOptions,
    StreamingParameters,
    StreamingEvents,
    BeginEvent,
    TurnEvent,
    StreamingError,
)

conversation_history = [
    {"role": "system", "content": "You are a helpful voice assistant."}
]

def on_turn(client: StreamingClient, event: TurnEvent):
    if not event.end_of_turn:
        return

    user_text = event.transcript
    print(f"\nUser: {user_text}")

    classification = classify_utterance(user_text)
    if classification["escalate"]:
        speak("Let me get a human on the line for you.")
        return

    for sentence in respond_with_tools(user_text, conversation_history):
        speak(sentence)

import assemblyai as aai

def main():
    client = StreamingClient(
        StreamingClientOptions(
            api_key=ASSEMBLYAI_API_KEY,
            api_host="streaming.assemblyai.com",
        )
    )
    client.on(StreamingEvents.Turn, on_turn)
    client.connect(
        StreamingParameters(speech_model="u3-rt-pro", sample_rate=16000)
    )

    try:
        client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
    finally:
        client.disconnect(terminate=True)

if __name__ == "__main__":
    main()

Speak into your microphone, ask about an order ID, and watch the agent execute the tool call and stream the spoken reply back. The combination of streaming STT, streaming LLM, and tool calling produces the responsive voice experience users now expect.

When to use which technique

Pattern	Use it when
Streaming reply only	The user asked a question; you want a fast, conversational answer
Tool calling + streamed reply	The agent needs to act on real data (order lookup, scheduling, transfers)
Structured outputs	You need machine-readable output for routing, classification, or downstream logic
Structured + streamed combo	Classify the intent in JSON, then stream a conversational acknowledgment to the user

Skip the wiring with the Voice Agent API

Streaming, tools, structured outputs, and an STT-LLM-TTS pipeline tied together—if you're building a single voice agent and don't need to swap LLM providers per request, AssemblyAI's Voice Agent API bundles all of this behind one WebSocket. You set a system prompt, register tools, and get back streamed audio with built-in turn detection and barge-in. Same Universal-3 Pro Streaming foundation, same fallback patterns, no glue code.

The lower-level approach in this tutorial is the right call when you need maximum control—choosing different LLMs per request, applying custom retry logic, or running structured-output classification in parallel with the spoken reply. Both paths are first-class on AssemblyAI; pick the one that matches your constraint.

Streaming everything is the new baseline for voice. Tool calling and structured outputs are what turn a streaming chatbot into something that can actually do work. Build for both and your voice agent stops feeling like a demo.

Frequently asked questions

What does it mean to stream LLM responses in a voice pipeline?

Streaming LLM responses means receiving and processing the model's output token by token as it's generated, instead of waiting for the full response to complete. In a voice pipeline, streaming lets you forward each completed sentence to a text-to-speech engine the moment the LLM emits it—so the user hears the first sentence of the agent's reply while the LLM is still generating the second. This typically cuts perceived response time by 60–80% for any reply longer than two sentences.

How do I stream LLM responses through AssemblyAI's LLM Gateway?

Set stream: True in your chat/completions request and read the response as a Server-Sent Events (SSE) stream. Each chunk contains a partial token in the choices[0].delta.content field. Buffer tokens, watch for sentence-ending punctuation, and flush each completed sentence to your TTS engine as soon as it's ready. Streaming is supported on OpenAI models in LLM Gateway today.

How does tool calling work with the LLM Gateway?

Tool calling lets your voice agent invoke functions to access data or trigger actions—looking up an order, scheduling a callback, transferring to a human. Define your tools as JSON Schema in the tools array, and when the model decides to call one it returns a tool_calls field on the assistant message. You execute the tool, append the result to the conversation history, and call the Gateway again to let the model produce a spoken response that incorporates the tool output. The schema is OpenAI-compatible, so the same code works across Claude, GPT, and Gemini.

Can I get structured JSON outputs from the LLM Gateway for voice agents?

Yes—LLM Gateway supports structured outputs via JSON schema using the response_format parameter. This guarantees the model returns exactly the shape you specified, which is useful for intent classification, routing decisions, sentiment scoring, and any voice agent workflow that needs machine-readable output your downstream code can consume directly. A common voice pattern is to classify intent in JSON first, then stream a conversational acknowledgment back to the user while your code routes the call in the background.

What's the latency budget for a real-time voice agent using streamed LLM responses?

A well-tuned voice pipeline targets 600–900 ms from the moment the user stops speaking to the moment they hear the agent's first audio. That budget breaks down roughly as: 200–500 ms for STT finalization, 150–400 ms for LLM time-to-first-token, 200–400 ms for TTS time-to-first-audio, and 50–150 ms of network overhead. Streaming everything—STT transcripts, LLM tokens, TTS audio—is what makes hitting that budget achievable.

When should I use the Voice Agent API instead of wiring streaming STT and LLM Gateway separately?

Use the Voice Agent API when you're building a single voice agent and want one WebSocket that handles STT, LLM, TTS, turn detection, and tool calling out of the box. Use the lower-level streaming STT plus LLM Gateway approach when you need more control—choosing different LLMs per request, applying custom retry logic, or running structured-output classification in parallel with the spoken reply. Both options use the same Universal-3 Pro Streaming foundation, so accuracy is identical.

Does streaming work with tool calling and structured outputs?

Yes—streaming composes with both. With tool calling, the agent thinks for a moment, executes the tool, then streams the spoken result token by token. With structured outputs, you typically don't stream the JSON itself (you want the complete object before parsing) but you can stream a separate conversational acknowledgment to the user while the structured classification finalizes in parallel.

Build an AI voice agent for customer support that can look up orders

Mart Schweiger — Tue, 12 May 2026 17:59:18 +0000

Tier-1 customer support is mostly the same five conversations on repeat: where's my order, can I change my address, can I get a refund, when does this ship, can I talk to a human. They're predictable, they're high-volume, and they don't need a person—they need a voice agent that can actually look things up.

We're using AssemblyAI's Voice Agent API—one WebSocket that handles the speech understanding, LLM reasoning, voice generation, turn detection, and tool calling in a single connection. Total time to a working prototype: about an afternoon.

Why most support voice agents fail

Before we build, it's worth knowing where these things break. The pattern is almost always the same:

Customer says "my order ID is A-B-3-7-9-2"
STT mishears it as "a b 37 92" or "ABE 379 to"
The LLM calls get_order_status("ab3792") or worse, asks the customer to repeat
Customer hangs up

The agent didn't fail because the LLM was wrong. It failed because the speech-to-text layer couldn't capture the entity correctly. This is why entity accuracy on alphanumerics, emails, and phone numbers matters more than overall WER for support agents—and why we're building on Universal-3 Pro Streaming, which has a "16.7% mixed-entity error rate vs. 23-25% for competing models."

The second-most-common failure: dead air during tool calls. The customer asks a question, the agent calls a backend, and there's a 2-3 second silence while the lookup runs. The Voice Agent API solves this by speaking a natural transition phrase ("let me check that for you") while the tool runs—no dead air, no awkward pauses.

What you'll build

A Python voice support agent that handles three real workflows:

Order status lookup—customer says "where's my order?" then the agent asks for the ID, looks it up, and reads back status, ETA, and tracking number
Customer info verification—customer provides email or phone number, the agent looks up the account, and confirms identity before proceeding
Human escalation—customer asks for a person, or the agent gets stuck, and a graceful transfer happens with conversation context preserved

Stack:

AssemblyAI Voice Agent API (one WebSocket: STT + LLM + TTS)
Python 3.9+
A backend with order data—we'll mock it; replace with your real CRM or order management system

Setup

pip install websockets pyaudio python-dotenv

Create .env:

ASSEMBLYAI_API_KEY=your_key_here

The Voice Agent API uses a single endpoint: wss://agents.assemblyai.com/v1/ws. One key, one connection, no separate STT or TTS providers to wire in.

Step 1: Define the support tools

Tools are the agent's interface to your backend. The Voice Agent API uses standard JSON Schema, so anything you can describe with a schema, the agent can call.

For a support agent, you typically want four tools:

import json

TOOLS = [
    {
        "type": "function",
        "name": "get_order_status",
        "description": "Look up an order's current status, shipping ETA, and tracking number by order ID.",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The customer's order ID, e.g. ORD-12345 or 78231-ABC.",
                },
            },
            "required": ["order_id"],
        },
    },
    {
        "type": "function",
        "name": "lookup_account_by_email",
        "description": "Find a customer account using their email address.",
        "parameters": {
            "type": "object",
            "properties": {
                "email": {"type": "string", "description": "The customer's email address."},
            },
            "required": ["email"],
        },
    },
    {
        "type": "function",
        "name": "list_recent_orders",
        "description": "List the customer's most recent orders. Use after the account is verified.",
        "parameters": {
            "type": "object",
            "properties": {
                "account_id": {"type": "string"},
                "limit": {"type": "integer", "description": "Max number of orders to return.", "default": 5},
            },
            "required": ["account_id"],
        },
    },
    {
        "type": "function",
        "name": "transfer_to_human",
        "description": "Transfer the call to a human agent. Use when the customer asks, when you can't help, or when the issue is sensitive.",
        "parameters": {
            "type": "object",
            "properties": {
                "reason": {"type": "string", "description": "Short reason for the transfer."},
                "summary": {"type": "string", "description": "Brief summary of the conversation so far."},
            },
            "required": ["reason", "summary"],
        },
    },
]

Now implement the actual functions. Replace these stubs with calls to your real backend:

ORDERS_DB = {
    "ORD-12345": {"status": "shipped", "eta": "2026-05-09", "tracking": "1Z999AA10123456784"},
    "ORD-67890": {"status": "processing", "eta": "2026-05-12", "tracking": None},
}

ACCOUNTS_DB = {
    "jane@example.com": {"account_id": "ACC-001", "name": "Jane Doe"},
}

ACCOUNT_ORDERS = {
    "ACC-001": [
        {"order_id": "ORD-12345", "date": "2026-05-01", "total": "$84.99"},
        {"order_id": "ORD-12100", "date": "2026-04-22", "total": "$42.00"},
    ],
}

def run_tool(name: str, args: dict) -> dict:
    if name == "get_order_status":
        order = ORDERS_DB.get(args["order_id"].upper())
        if not order:
            return {"error": "order_not_found", "order_id": args["order_id"]}
        return order

    if name == "lookup_account_by_email":
        account = ACCOUNTS_DB.get(args["email"].lower())
        if not account:
            return {"error": "account_not_found"}
        return account

    if name == "list_recent_orders":
        orders = ACCOUNT_ORDERS.get(args["account_id"], [])
        return {"orders": orders[: args.get("limit", 5)]}

    if name == "transfer_to_human":
        return {"transferred": True, "queue": "support-tier-2"}

    return {"error": f"unknown_tool: {name}"}

The error-shape pattern matters. When get_order_status can't find an order, it returns a structured error rather than throwing—that gives the LLM the context it needs to apologize and ask the customer to verify the ID, instead of crashing the conversation.

Step 2: Write the system prompt

The system prompt is where you encode the agent's behavior. For support, you want a few things every time:

Identity and tone
When to ask for verification before sharing details
When to use which tool
When to transfer to a human
Specific phrasing for transition moments (the "let me check that" line)

SYSTEM_PROMPT = """
You are Avery, a customer support agent for Acme Corp. Your goal is to help customers
quickly and accurately. You have access to tools that let you look up orders and accounts.

Behavior rules:
- Greet warmly and ask how you can help.
- For order questions, ask for the order ID first if the customer hasn't given it.
- If a customer gives an email or phone number, use lookup_account_by_email to verify.
- Read order status, ETA, and tracking number clearly. Don't read raw timestamps —
  say dates naturally (e.g., "Friday, May 9th").
- When you need to call a tool, say a brief transition like "Let me check on that"
  or "One moment while I pull that up."
- If the customer asks for a human, sounds frustrated, or has a complex issue
  (refund disputes, damaged product, billing errors), use transfer_to_human and
  include a short summary.
- Never make up an order ID, status, or tracking number. If a tool returns an error,
  apologize, ask the customer to verify the ID, and try again.
- Keep replies short and conversational. This is a phone call, not an email.
"""

Step 3: Connect to the Voice Agent API

Now the WebSocket connection. The pattern is:

Open wss://agents.assemblyai.com/v1/ws with your API key
Send session.update with the system prompt, tools, voice, and greeting
Wait for session.ready, then start streaming microphone audio
Handle incoming events—tool.call, reply.audio, transcript.user, reply.done

import asyncio
import websockets
import os
import pyaudio

API_KEY = os.getenv("ASSEMBLYAI_API_KEY")
WS_URL = "wss://agents.assemblyai.com/v1/ws"
SAMPLE_RATE = 24000

async def run_agent():
    async with websockets.connect(
        WS_URL,
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": SYSTEM_PROMPT,
                "greeting": "Hi, this is Avery from Acme support. How can I help?",
                "output": {"voice": "ivy"},
                "tools": TOOLS,
            },
        }))

        pa = pyaudio.PyAudio()
        mic = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                      input=True, frames_per_buffer=1024)
        speaker = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                          output=True)

        ready = asyncio.Event()
        pending_tools = []

        async def send_audio():
            await ready.wait()
            import base64
            while True:
                audio = mic.read(1024, exception_on_overflow=False)
                await ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": base64.b64encode(audio).decode(),
                }))
                await asyncio.sleep(0)

        async def handle_messages():
            async for raw in ws:
                event = json.loads(raw)
                t = event.get("type")

                if t == "session.ready":
                    ready.set()
                    print("Agent ready. Start speaking.")

                elif t == "transcript.user":
                    print(f"\nUser: {event['text']}")

                elif t == "transcript.agent":
                    print(f"Agent: {event['text']}")

                elif t == "reply.audio":
                    import base64
                    speaker.write(base64.b64decode(event["data"]))

                elif t == "tool.call":
                    name = event["name"]
                    args = event.get("arguments", {})
                    print(f"  [tool] {name}({args})")
                    result = run_tool(name, args)
                    pending_tools.append({"call_id": event["call_id"], "result": result})

                elif t == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                    elif pending_tools:
                        for tool in pending_tools:
                            await ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": json.dumps(tool["result"]),
                            }))
                        pending_tools.clear()

        await asyncio.gather(send_audio(), handle_messages())

if __name__ == "__main__":
    asyncio.run(run_agent())

A few details that the docs flag and you'd otherwise debug for an hour:

Don't send tool.result immediately when you receive tool.call. Accumulate results and send them inside the reply.done handler. Sending too early causes timing issues.
Discard pending tool results on interruption. If the user speaks while the agent is generating a transition phrase, you'll get reply.done with status: "interrupted"—clear the buffer and wait for the next turn.
Voice names are case-sensitive. Use lowercase: ivy, james, mia, winter, bella. An invalid voice returns session.error.

Step 4: Test the three workflows

Run the script and walk through each support scenario. You should hear:

Workflow 1—Order lookup:

You: "Hi, I'm trying to check on order O-R-D 1-2-3-4-5"
Agent: "Sure, let me check on that... I see order ORD-12345. It shipped and is
        on its way — you should have it by Friday, May 9th. The tracking number
        is 1Z999AA10123456784."

Workflow 2—Email-based account lookup:

You: "I forgot my order ID. Can you look me up by email?"
Agent: "Of course. What's the email on the account?"
You: "It's jane at example dot com."
Agent: "One moment... Got it, you're Jane Doe. I see two recent orders:
        ORD-12345 from May 1st for $84.99, and ORD-12100 from April 22nd
        for $42.00. Which one are you asking about?"

Workflow 3—Human transfer:

You: "I just want to talk to a person."
Agent: "I understand. Let me get you over to a teammate now."
[tool.call: transfer_to_human({"reason": "user requested human", "summary": "..."})]

Speak the order ID with hesitation, mumbles, accents, and natural disfluencies—that's where Universal-3 Pro Streaming earns its keep. The agent should still extract the ID correctly because it's tuned for the alphanumeric tokens that voice agents act on.

Step 5: Take it to the phone

This works in your browser through your microphone, but real customer support runs on phones. Twilio Media Streams is the standard bridge—your server accepts the inbound call from Twilio and opens a parallel connection to the Voice Agent API, forwarding audio in both directions.

The Voice Agent API supports audio/pcmu (G.711 u-law at 8 kHz) natively, which matches Twilio's codec exactly. No transcoding, no resampling. The Twilio integration guide walks through the full bridge in about 100 lines of TypeScript.

What to harden before production

Three things you'll want to nail down before pointing this at real customers:

Replace the in-memory mocks with calls to your actual CRM or order management system. Add timeouts and error handling so a slow backend doesn't kill the conversation.
Log everything. Save user transcripts, tool calls, results, and the agent's responses tied to a session ID. Conversation logs are your debugging tool when something goes wrong on call #4,712. Conversation intelligence features like speaker diarization can help you analyze these logs at scale.
Tune turn detection for your acoustic environment. The defaults work for most use cases. For phone audio with background noise, you may want to raise min_end_of_turn_silence_ms slightly so the agent doesn't cut off thoughtful pauses.

Where to go from there

Frequently asked questions

How do I build an AI voice agent for customer support that can look up orders?

Build it on AssemblyAI's Voice Agent API, register a get_order_status function as a tool with JSON Schema, and connect to the WebSocket at wss://agents.assemblyai.com/v1/ws. The agent transcribes the customer's speech, decides when to call your function, executes it through your backend, and speaks the result back—all on a single connection. Most developers ship a working agent in an afternoon because there's no SDK to learn and no separate STT, LLM, or TTS providers to wire together.

Why does speech-to-text accuracy matter so much for support voice agents?

Support agents constantly need to capture alphanumeric tokens—order IDs, account numbers, email addresses, phone numbers—and a single transcription error breaks the workflow. If the STT layer mishears "ORD-12345" as "or 12 three 45," your get_order_status function gets a garbled ID and returns nothing. AssemblyAI's Voice Agent API is built on Universal-3 Pro Streaming, which has a "16.7% mixed-entity error rate vs. 23–25% for competing models"—that's the difference between tool calls that succeed and tool calls that silently fail.

How does tool calling work with the AssemblyAI Voice Agent API?

You register tools by passing an array of function definitions in session.tools on a session.update event. When the agent decides to call a tool, it emits a tool.call event with the function name and arguments. You execute the function and accumulate results, then send tool.result events inside your reply.done handler—not immediately on tool.call. While the tool runs, the agent speaks a brief transition phrase like "let me check that for you" so the conversation never goes silent.

Can I connect AssemblyAI's Voice Agent API to phone calls with Twilio?

Yes—the Voice Agent API supports audio/pcmu (G.711 u-law at 8 kHz) natively, which matches Twilio's codec exactly with no transcoding needed. You set up a server that accepts the inbound Twilio Media Streams call, opens a parallel WebSocket to the Voice Agent API, and forwards audio in both directions. The official Twilio integration guide walks through inbound and outbound calling in about 100 lines of TypeScript.

What's the best way to handle escalation to a human in a customer support voice agent?

How much does it cost to run a customer support voice agent on AssemblyAI?

The Voice Agent API is $4.50/hr flat—covering speech understanding, LLM reasoning, voice generation, turn detection, and tool calling all in one bill. There are no per-token surcharges, no concurrency caps, and no separate invoices for STT, LLM, and TTS providers. Pricing is billed by the minute on actual conversation duration, and a free tier is available for testing.

Do voice agents built with AssemblyAI work with healthcare workflows?

AssemblyAI offers a BAA for HIPAA workloads and is SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certified. For clinical use cases (medical front-office voice agents, healthcare contact centers), enable Medical Mode with domain="medical-v1" to improve transcription accuracy on medication names, procedures, conditions, and dosages. Do not point the agent at real PHI without a signed BAA in place.

Building a voice-powered e-commerce shopping assistant

Mart Schweiger — Tue, 12 May 2026 17:57:59 +0000

Voice shopping has crossed an inflection point. Search by typing is being replaced by search by saying—"show me waterproof hiking boots under $150 in size 10," "add the second one to my cart," "what's the return policy on these." For e-commerce teams, that's both an opportunity and a problem: the existing product search and checkout flow was designed for clicks and keystrokes, not natural language.

This tutorial walks through building a voice-powered shopping assistant that customers can actually talk to. By the end, you'll have a Python voice agent that handles four real shopping workflows—product search, add-to-cart, order tracking, and checkout assistance—all on top of a single WebSocket connection using AssemblyAI's Voice Agent API.

The same pattern works whether you're embedding voice into a mobile shopping app, an in-store kiosk, a smart speaker integration, or a phone-based ordering line.

Why voice e-commerce is different from voice support

If you've built a customer support Voice AI agent, the shopping use case looks similar—but the constraints are sharper:

Entity accuracy is everything. Sizes ("size ten and a half"), SKUs ("SKU 9-9-2-1-A"), prices ("under one fifty"), quantities ("get me three of those"). Mishear any of those and you've added the wrong item, the wrong size, or the wrong quantity to a cart someone is about to check out with.
Conversations are exploratory. Support calls have a clear job-to-be-done; shopping conversations meander. The customer browses, narrows, compares, asks about returns, gets distracted, comes back. The agent has to track all of that without losing context.
Stakes shift mid-conversation. "Tell me about this jacket" is low-stakes. "Charge my saved card for $284.50" is not. The agent needs to know when to ask for confirmation and when to just answer.
Accent and code-switching show up. Shoppers globally pronounce brand names, colors, and product categories differently. The agent needs to handle that gracefully.

The Voice Agent API addresses these directly: built on Universal-3 Pro Streaming for high entity accuracy (16.7% mixed-entity error rate vs. 23–25% for competitors), with mid-conversation system prompt updates so you can tighten or loosen the agent's behavior as the customer moves from browsing to buying.

What you'll build

A Python voice shopping assistant that handles four workflows:

Product search—"show me wireless headphones under $200" → searches your catalog → reads back top results
Cart management—"add the second one in black" → adds to cart → confirms
Order tracking—"where's my order from last week?" → looks up customer orders → reads back status
Checkout assistance—guides the user through review and confirmation, never executing payment without explicit verbal "yes"

Stack:

AssemblyAI Voice Agent API (one WebSocket: STT + LLM + TTS)
Python 3.9+
A product catalog and order DB—we'll mock both; replace with your real Shopify, Commerce Cloud, or BigCommerce backend

Setup

pip install websockets pyaudio python-dotenv

# .env
ASSEMBLYAI_API_KEY=your_key_here

Endpoint: wss://agents.assemblyai.com/v1/ws. One key, one connection—the same key works for all AssemblyAI products.

Step 1: Define the shopping tools

The toolset shapes what your agent can do. Start with the four core shopping verbs and grow from there. The Voice Agent API supports tool calling natively, so each tool is defined as a JSON function schema.

import json

TOOLS = [
    {
        "type": "function",
        "name": "search_products",
        "description": "Search the product catalog. Use whenever the customer is browsing or asking what's available.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Free-text query, e.g. 'waterproof hiking boots'"},
                "max_price": {"type": "number"},
                "size": {"type": "string"},
                "color": {"type": "string"},
                "limit": {"type": "integer", "default": 5},
            },
            "required": ["query"],
        },
    },
    {
        "type": "function",
        "name": "get_product_details",
        "description": "Get full details on a specific product, including return policy and stock.",
        "parameters": {
            "type": "object",
            "properties": {"product_id": {"type": "string"}},
            "required": ["product_id"],
        },
    },
    {
        "type": "function",
        "name": "add_to_cart",
        "description": "Add a product to the customer's cart. Confirm size, color, and quantity before calling.",
        "parameters": {
            "type": "object",
            "properties": {
                "product_id": {"type": "string"},
                "variant_id": {"type": "string", "description": "Specific size/color variant"},
                "quantity": {"type": "integer", "default": 1},
            },
            "required": ["product_id", "variant_id"],
        },
    },
    {
        "type": "function",
        "name": "view_cart",
        "description": "Read back the customer's current cart with subtotal.",
        "parameters": {"type": "object", "properties": {}, "required": []},
    },
    {
        "type": "function",
        "name": "remove_from_cart",
        "description": "Remove an item from the cart by line item ID.",
        "parameters": {
            "type": "object",
            "properties": {"line_item_id": {"type": "string"}},
            "required": ["line_item_id"],
        },
    },
    {
        "type": "function",
        "name": "checkout",
        "description": "Submit the order using the customer's saved payment and shipping. ONLY call after explicit verbal 'yes' to a clear confirmation prompt.",
        "parameters": {
            "type": "object",
            "properties": {
                "confirmation_phrase": {
                    "type": "string",
                    "description": "The exact phrase the customer said to confirm, e.g. 'yes, place the order'.",
                }
            },
            "required": ["confirmation_phrase"],
        },
    },
    {
        "type": "function",
        "name": "track_order",
        "description": "Look up the status of a customer order.",
        "parameters": {
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
        },
    },
]

The confirmation_phrase parameter on checkout is the trick that prevents accidental orders. The system prompt tells the agent it can only call checkout if the customer literally says yes—and the parameter forces the agent to record what was said. Your backend can additionally enforce that only a list of accepted phrases ("yes", "place the order", "go ahead") triggers the actual payment.

Step 2: Implement the backend (mocked)

Replace these stubs with calls to your real catalog and order systems.

CATALOG = [
    {
        "product_id": "SKU-2201",
        "name": "Trail Runner 3 hiking boots",
        "price": 139.00,
        "category": "footwear",
        "tags": ["waterproof", "hiking"],
        "variants": [
            {"variant_id": "SKU-2201-BK-10", "size": "10", "color": "black", "stock": 4},
            {"variant_id": "SKU-2201-BK-11", "size": "11", "color": "black", "stock": 0},
            {"variant_id": "SKU-2201-BR-10", "size": "10", "color": "brown", "stock": 7},
        ],
        "return_policy": "30-day free returns",
    },
    {
        "product_id": "SKU-3104",
        "name": "Summit Pro waterproof boots",
        "price": 199.00,
        "category": "footwear",
        "tags": ["waterproof", "hiking", "premium"],
        "variants": [
            {"variant_id": "SKU-3104-BK-10", "size": "10", "color": "black", "stock": 2},
        ],
        "return_policy": "30-day free returns",
    },
]

CART = []  # In production, scope this per session
ORDERS = {
    "ORD-9981": {"status": "shipped", "eta": "2026-05-09", "tracking": "1Z999AA10123456784"},
}

def run_tool(name: str, args: dict) -> dict:
    if name == "search_products":
        results = [
            p for p in CATALOG
            if args["query"].lower() in (p["name"] + " " + " ".join(p["tags"])).lower()
            and (not args.get("max_price") or p["price"] <= args["max_price"])
        ]
        return {"results": results[: args.get("limit", 5)]}

    if name == "get_product_details":
        for p in CATALOG:
            if p["product_id"] == args["product_id"]:
                return p
        return {"error": "product_not_found"}

    if name == "add_to_cart":
        for p in CATALOG:
            for v in p["variants"]:
                if v["variant_id"] == args["variant_id"]:
                    if v["stock"] < args.get("quantity", 1):
                        return {"error": "out_of_stock", "available": v["stock"]}
                    line = {
                        "line_item_id": f"LI-{len(CART) + 1}",
                        "product_id": p["product_id"],
                        "variant_id": v["variant_id"],
                        "name": p["name"],
                        "size": v["size"],
                        "color": v["color"],
                        "quantity": args.get("quantity", 1),
                        "price": p["price"],
                    }
                    CART.append(line)
                    return {"added": line, "cart_size": len(CART)}
        return {"error": "variant_not_found"}

    if name == "view_cart":
        subtotal = sum(item["price"] * item["quantity"] for item in CART)
        return {"items": CART, "subtotal": round(subtotal, 2)}

    if name == "remove_from_cart":
        global CART
        CART = [item for item in CART if item["line_item_id"] != args["line_item_id"]]
        return {"removed": args["line_item_id"], "cart_size": len(CART)}

    if name == "checkout":
        accepted = ["yes", "place the order", "go ahead", "confirm", "buy it"]
        if not any(phrase in args["confirmation_phrase"].lower() for phrase in accepted):
            return {"error": "confirmation_unclear", "phrase_received": args["confirmation_phrase"]}
        return {"order_id": "ORD-9982", "total": sum(i["price"] * i["quantity"] for i in CART)}

    if name == "track_order":
        order = ORDERS.get(args["order_id"].upper())
        return order or {"error": "order_not_found"}

    return {"error": f"unknown_tool: {name}"}

Step 3: Write a shopping-aware system prompt

Shopping system prompts should encode three patterns: how to describe products on a phone (terse, scannable), how to gather variant info (size, color, quantity) before adding to cart, and how to confirm checkout.

SYSTEM_PROMPT = """
You are Riley, a friendly voice shopping assistant for Trailgear, an outdoor retailer.

Behavior rules:

PRODUCT SEARCH
- When customers ask about products, call search_products with a clean query.
- Read back top 2-3 results conversationally. Don't list more than 3 unless asked.
- Format prices naturally: "one hundred thirty-nine dollars" not "139.00".
- Mention only the most relevant detail per product (price + key feature). Save the rest for follow-ups.

VARIANT SELECTION
- Before adding to cart, confirm size, color, and quantity. Never assume.
- If a variant is out of stock, say so immediately and offer the closest alternative.
- Read sizes naturally: "size ten" not "size 10".

CART MANAGEMENT
- After adding, briefly confirm what was added and the new cart size.
- If the customer asks "what's in my cart," call view_cart and read it back with subtotal.

CHECKOUT
- Before calling checkout, summarize the cart and explicitly ask: "Should I place the order?"
- ONLY call checkout if the customer responds with a clear yes (e.g., "yes," "place it," "go ahead," "confirm").
- If the response is ambiguous, ask again. Do not interpret "sure I think so" as confirmation.
- After checkout succeeds, read back the order ID slowly so the customer can write it down.

ORDER TRACKING
- For order status questions, ask for the order ID.
- When reading a tracking number, slow down and group digits in pairs.

GENERAL
- Keep replies short and conversational. This is voice, not chat.
- When you call a tool, say a brief transition like "Let me look that up."
- Never invent products, prices, or stock — if the catalog doesn't have it, say so.
"""

The "explicit yes" pattern is what makes this safe to point at production payments. The agent's prompt forbids it from calling checkout on ambiguous responses, and the backend independently validates the confirmation phrase. Belt and suspenders.

Step 4: Wire the WebSocket

This is essentially the same WebSocket loop as a support agent—the difference is in the tools and prompt, not the protocol. If you've already built a voice agent with function calling, this will look familiar.

import asyncio
import os
import base64
import websockets
import pyaudio

API_KEY = os.getenv("ASSEMBLYAI_API_KEY")
WS_URL = "wss://agents.assemblyai.com/v1/ws"
SAMPLE_RATE = 24000

async def run_assistant():
    async with websockets.connect(
        WS_URL, additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": SYSTEM_PROMPT,
                "greeting": "Hi, this is Riley from Trailgear. What can I help you find today?",
                "output": {"voice": "mia"},
                "tools": TOOLS,
            },
        }))

        pa = pyaudio.PyAudio()
        mic = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                      input=True, frames_per_buffer=1024)
        speaker = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                          output=True)

        ready = asyncio.Event()
        pending_tools = []

        async def send_audio():
            await ready.wait()
            while True:
                data = mic.read(1024, exception_on_overflow=False)
                await ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": base64.b64encode(data).decode(),
                }))
                await asyncio.sleep(0)

        async def handle_messages():
            async for raw in ws:
                event = json.loads(raw)
                t = event.get("type")

                if t == "session.ready":
                    ready.set()
                    print("Riley is ready.")
                elif t == "transcript.user":
                    print(f"\nCustomer: {event['text']}")
                elif t == "transcript.agent":
                    print(f"Riley: {event['text']}")
                elif t == "reply.audio":
                    speaker.write(base64.b64decode(event["data"]))
                elif t == "tool.call":
                    print(f"  [tool] {event['name']}({event.get('arguments', {})})")
                    result = run_tool(event["name"], event.get("arguments", {}))
                    pending_tools.append({"call_id": event["call_id"], "result": result})
                elif t == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                    elif pending_tools:
                        for tool in pending_tools:
                            await ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": json.dumps(tool["result"]),
                            }))
                        pending_tools.clear()

        await asyncio.gather(send_audio(), handle_messages())

if __name__ == "__main__":
    asyncio.run(run_assistant())

Step 5: Test the four workflows

Run the script and walk through each shopping flow:

Search:

You: "I'm looking for waterproof hiking boots under one fifty."
Riley: "Let me check our catalog... I've got the Trail Runner 3 at $139,
        which is waterproof and great for day hikes, and the Summit Pro
        at $199 — that's premium with extra ankle support. Want to hear
        more about either?"

Add to cart:

You: "The Trail Runner threes — size ten, in black."
Riley: "Got it, the Trail Runner 3 in black, size ten. How many?"
You: "Just one."
Riley: "One moment... Added. That's one item in your cart."

Checkout:

You: "Okay let's check out."
Riley: "Sure. You have one Trail Runner 3 in black, size ten, for $139.
        Should I place the order?"
You: "Yes, place it."
Riley: "Order placed. Your order ID is O-R-D 9-9-8-2."

Track order:

You: "Where's my order from last week — ORD 9-9-8-1?"
Riley: "Let me check... ORD-9981 has shipped and should arrive Friday,
        May 9th. Tracking is 1Z 99 9A A1 01 23 45 67 84."

The tracking number readback is intentional—grouping digits in pairs is a common voice pattern that makes long alphanumerics easier to write down.

Where this gets harder in production

Two patterns to plan for once the basic loop works:

Personalization. Authenticated shoppers expect the agent to know their saved address, recent purchases, and size preferences. Add a get_customer_profile() tool gated on session auth. Use the result in the system prompt via mid-conversation session.update so the agent personalizes without re-asking.
Multi-turn refinement. "Show me hiking boots" → "in waterproof" → "size ten only" → "under $150." Each refinement should narrow the same result set rather than triggering a fresh search. Pass a session_filters object as a tool parameter and have the agent accumulate filters across turns.

Where to take it from here

The same architecture extends to:

In-store kiosks for hands-free product search
Phone-based ordering lines for restaurants, takeout, and reorders (Twilio Media Streams + the Voice Agent API Twilio integration)
Mobile shopping apps with a press-and-hold voice button
Smart speaker integrations that hand off to your agent when the user wants to shop

What stays the same across all of those: one WebSocket, one system prompt, one tool registry. The voice agent is the same regardless of the front-end channel.

Voice shopping isn't replacing search bars or product pages—it's running alongside them, picking up the conversational moments those interfaces can't handle. Build for the conversational moments and the rest of your funnel benefits from it. For teams building AI-powered customer service workflows, the same voice agent architecture handles both pre-sale shopping and post-sale support.

Frequently asked questions

How do I build a voice-powered shopping assistant for e-commerce?

Build it on AssemblyAI's Voice Agent API and register the four core shopping verbs as tools: search_products, add_to_cart, view_cart, and checkout. The agent transcribes the customer's speech, calls your catalog and cart functions, and speaks the result back—all on a single WebSocket. Most developers have a working voice shopping assistant running the same day, with no SDK to install and no separate STT, LLM, or TTS providers to manage.

Can a voice shopping assistant handle product variants like size, color, and quantity?

Yes—define the variant fields as parameters on your add_to_cart tool (e.g., variant_id, quantity) and instruct the agent in the system prompt to confirm size, color, and quantity before calling the function. The Voice Agent API is built on Universal-3 Pro Streaming, which has industry-leading accuracy on alphanumeric tokens like sizes, SKUs, and quantities—that's what makes "size ten and a half" reliably parse as 10.5 instead of 10 or 1010.

How do I prevent accidental orders in a voice checkout flow?

Use a two-layer pattern: the system prompt instructs the agent to only call the checkout tool after an explicit verbal "yes" to a clear confirmation prompt, and the checkout function itself accepts a confirmation_phrase parameter that your backend independently validates against an accepted list ("yes," "place the order," "go ahead," "confirm"). This belt-and-suspenders design ensures ambiguous responses like "sure I think so" never trigger a real charge.

What channels can I deploy a voice shopping assistant on?

The same Voice Agent API connection powers in-app voice (mobile or web with a press-and-hold button), in-store kiosks for hands-free product search, phone-based ordering lines via Twilio Media Streams, and smart speaker integrations that hand off to your agent for shopping. The system prompt and tool registry stay the same across channels—only the front-end audio path changes.

How does the AssemblyAI Voice Agent API compare to Vapi or Retell for e-commerce?

The Voice Agent API is infrastructure rather than a platform—it gives you a standard JSON WebSocket with full control over conversation design, tool integrations, and agent behavior, so your voice shopping experience can feel uniquely yours instead of like every other agent built on a no-code platform. Vapi and Retell are higher-level platforms that work well for non-technical configuration but constrain custom integrations and agent personality. For e-commerce teams that already have engineering capacity, the Voice Agent API is typically a better fit.

How do I personalize a voice shopping assistant for authenticated customers?

Add a get_customer_profile tool that returns the customer's saved address, payment, recent purchases, and size preferences, gated on session auth. The Voice Agent API supports mid-conversation session.update events, so you can update the system prompt with the customer's context after they authenticate without dropping the connection. The agent can then personalize recommendations, default to known sizes, and skip questions like "what's your shipping address?"

How much does it cost to run a voice shopping assistant on AssemblyAI?

The Voice Agent API is $4.50/hr flat-rate, covering STT, LLM, voice generation, turn detection, and tool calling on a single bill. There are no per-token surcharges, no concurrency caps, and no separate invoices for the STT, LLM, and TTS layers—pricing is billed by the minute on actual conversation duration. A free tier with $50 in starter credits is available for testing.