When a user finishes a sentence in a voice conversation, they expect to hear the agent start replying within roughly a second. Anything longer feels broken. The fastest way to hit that target isn't a faster LLM—it's not waiting for the LLM to finish before you start speaking.
Streaming the LLM response, sentence by sentence, into a TTS engine is the trick that turns a 4-second response time into a sub-second one. And once you're streaming, you can layer on tool calling for real-world actions and structured outputs for predictable downstream code—all without giving up that latency budget.
This tutorial walks through how to build that pipeline using AssemblyAI's LLM Gateway and Universal-3 Pro Streaming. By the end, you'll have a Python voice pipeline that:
- Streams microphone audio into AssemblyAI for live transcription
- Streams the LLM response token-by-token through LLM Gateway
- Calls tools mid-conversation to look up data or trigger actions
- Returns structured JSON when the workflow needs predictable output
- Hands each completed sentence to TTS as it arrives
Why streaming matters more in voice than in chat
In a chat UI, streaming is a nice-to-have—you see the response appear word by word instead of all at once. In a voice agent, it's the difference between conversational and broken.
The math is simple. End-to-end voice latency is roughly:
STT finalization (200-500 ms)
+ LLM time-to-first-token (150-400 ms)
+ TTS time-to-first-audio (200-400 ms)
+ network overhead (50-150 ms)
= 600-1500 ms before the user hears anything
If you wait for the full LLM response before sending text to TTS, add another 1-3 seconds onto that. Users notice. Conversation breaks. They start over.
If you stream—flushing each completed sentence to TTS as soon as the LLM emits it—the user hears the first sentence while the LLM is still generating the second. End-to-end latency stays inside the 600-900 ms range that feels conversational.
What you'll build
A Python pipeline that handles three voice agent patterns:
- Streamed conversational replies—the user asks a question; the agent's voice starts within ~1 second and flows naturally
- Tool calling—the user says "what's my order status?"; the agent calls get_order_status(order_id) and speaks the result
- Structured outputs—the agent returns a JSON object matching a schema (e.g., {intent, urgency, escalate}), which your code consumes directly without parsing freeform text
Stack:
- AssemblyAI Universal-3 Pro Streaming (speech-to-text)
- AssemblyAI LLM Gateway (streaming chat completions, tools, structured outputs)
- A TTS engine of your choice (we'll use a placeholder—same pattern works with any streaming TTS)
- Python 3.9+
Setup
pip install assemblyai requests python-dotenv pyaudio
Create .env:
ASSEMBLYAI_API_KEY=your_key_here
The same AssemblyAI API key authenticates both the streaming STT WebSocket and the LLM Gateway endpoint.
Step 1: Stream tokens from LLM Gateway
LLM Gateway supports OpenAI-style streaming on OpenAI models. Set stream: True in the request and read the response as a Server-Sent Events (SSE) stream. Each chunk contains a partial token; you stitch them together as they arrive.
The key trick for voice: don't wait for the full response. Buffer tokens, watch for sentence boundaries (., !, ?), and flush each completed sentence to TTS the instant it's ready.
import os
import json
import requests
from dotenv import load_dotenv
load_dotenv()
ASSEMBLYAI_API_KEY = os.getenv("ASSEMBLYAI_API_KEY")
def stream_llm_response(user_text: str):
"""
Stream the LLM response. Yield each completed sentence as it's generated.
"""
response = requests.post(
"https://llm-gateway.assemblyai.com/v1/chat/completions",
headers={"authorization": ASSEMBLYAI_API_KEY},
json={
"model": "gpt-5.2",
"messages": [
{"role": "system", "content": "You are a friendly voice assistant. Keep replies short."},
{"role": "user", "content": user_text},
],
"stream": True,
"max_tokens": 300,
},
stream=True,
timeout=15,
)
buffer = ""
sentence_endings = (".", "!", "?")
for line in response.iter_lines():
if not line:
continue
line = line.decode("utf-8")
if not line.startswith("data: "):
continue
data = line[6:]
if data == "[DONE]":
if buffer.strip():
yield buffer.strip()
return
chunk = json.loads(data)
delta = chunk["choices"][0]["delta"].get("content", "")
if not delta:
continue
buffer += delta
while any(buffer.find(p) != -1 for p in sentence_endings):
split_idx = max(buffer.rfind(p) for p in sentence_endings)
sentence = buffer[: split_idx + 1].strip()
buffer = buffer[split_idx + 1 :]
if sentence:
yield sentence
The generator yields each completed sentence as it's ready. Your TTS engine consumes these one at a time:
def speak(sentence: str):
"""Send a sentence to your TTS engine. Replace with your provider's API."""
print(f" [TTS] {sentence}")
# tts_client.stream(sentence)
def handle_final_transcript(user_text: str):
print(f"User: {user_text}")
for sentence in stream_llm_response(user_text):
speak(sentence)
This single change—yielding sentences as they arrive instead of waiting for the full reply—typically cuts perceived response time by 60-80% for any reply longer than two sentences.
Step 2: Add tool calling
Voice agents become useful the moment they can do something—look up an order, check inventory, schedule a callback, transfer to a human. LLM Gateway supports OpenAI-compatible tool calling across every supported model (Claude, OpenAI, Gemini), so you write the code once and it works no matter which provider you route to.
Define your tools:
tools = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Look up the status of a customer order by order ID.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID, e.g. ORD-12345",
}
},
"required": ["order_id"],
},
},
},
{
"type": "function",
"function": {
"name": "schedule_callback",
"description": "Schedule a callback with a sales rep.",
"parameters": {
"type": "object",
"properties": {
"phone_number": {"type": "string"},
"preferred_time": {"type": "string"},
},
"required": ["phone_number", "preferred_time"],
},
},
},
]
Implement the actual functions:
def get_order_status(order_id: str) -> dict:
return {"order_id": order_id, "status": "shipped", "eta": "2026-05-09"}
def schedule_callback(phone_number: str, preferred_time: str) -> dict:
return {"confirmation": "CB-9982", "phone": phone_number, "time": preferred_time}
TOOL_REGISTRY = {
"get_order_status": get_order_status,
"schedule_callback": schedule_callback,
}
Now extend the LLM call to handle tool requests. The Gateway returns a tool_calls field on the assistant message; you execute each tool, append the result to the conversation history, and call again to let the model produce its spoken response:
def stream_llm_response_with_history(history: list, model: str = "gpt-5.2"):
"""Stream a follow-up reply using the existing conversation history."""
response = requests.post(
"https://llm-gateway.assemblyai.com/v1/chat/completions",
headers={"authorization": ASSEMBLYAI_API_KEY},
json={"model": model, "messages": history, "stream": True, "max_tokens": 300},
stream=True,
timeout=15,
)
buffer = ""
sentence_endings = (".", "!", "?")
for line in response.iter_lines():
if not line:
continue
line = line.decode("utf-8")
if not line.startswith("data: "):
continue
data = line[6:]
if data == "[DONE]":
if buffer.strip():
yield buffer.strip()
return
chunk = json.loads(data)
delta = chunk["choices"][0]["delta"].get("content", "")
if not delta:
continue
buffer += delta
while any(p in buffer for p in sentence_endings):
split_idx = max(buffer.rfind(p) for p in sentence_endings)
sentence = buffer[: split_idx + 1].strip()
buffer = buffer[split_idx + 1 :]
if sentence:
yield sentence
def respond_with_tools(user_text: str, history: list):
history.append({"role": "user", "content": user_text})
response = requests.post(
"https://llm-gateway.assemblyai.com/v1/chat/completions",
headers={"authorization": ASSEMBLYAI_API_KEY},
json={
"model": "claude-sonnet-4-6",
"messages": history,
"tools": tools,
"max_tokens": 500,
},
).json()
message = response["choices"][0]["message"]
history.append(message)
if message.get("tool_calls"):
for tool_call in message["tool_calls"]:
fn_name = tool_call["function"]["name"]
fn_args = json.loads(tool_call["function"]["arguments"])
result = TOOL_REGISTRY[fn_name](**fn_args)
history.append({
"role": "tool",
"tool_call_id": tool_call["id"],
"content": json.dumps(result),
})
return stream_llm_response_with_history(history, model="gpt-5.2")
def _yield_once():
yield message["content"]
return _yield_once()
stream_llm_response_with_history is the same streaming function from Step 1, except it sends the full conversation history (which now includes the tool result) so the model can speak the answer naturally.
The clean part: tool calling and streaming compose. The model thinks for a moment ("let me check that for you"), executes the tool, and then streams the spoken result token by token—exactly the conversational rhythm users expect.
A note for entity-heavy use cases: if your tool parameters include order IDs, phone numbers, or email addresses, your speech-to-text accuracy on those tokens is what determines whether tool calls succeed. Universal-3 Pro Streaming has roughly 16.7% mixed-entity error rate vs. 23-25% for competing models—that's the difference between ORD-12345 and or 12 three 45 getting passed to your function.
Step 3: Use structured outputs for predictable JSON
Sometimes you don't want a spoken reply—you want machine-readable output your downstream code can act on. Routing decisions, intent classification, sentiment scoring, escalation flags. LLM Gateway supports structured outputs via JSON schema, which guarantees the model returns exactly the shape you specified.
Define the schema:
classification_schema = {
"type": "object",
"properties": {
"intent": {
"type": "string",
"enum": ["billing", "support", "sales", "cancel", "other"],
},
"urgency": {
"type": "string",
"enum": ["low", "medium", "high"],
},
"escalate": {"type": "boolean"},
"summary": {"type": "string"},
},
"required": ["intent", "urgency", "escalate", "summary"],
}
Send it with response_format:
def classify_utterance(user_text: str) -> dict:
response = requests.post(
"https://llm-gateway.assemblyai.com/v1/chat/completions",
headers={"authorization": ASSEMBLYAI_API_KEY},
json={
"model": "gpt-5.2",
"messages": [
{"role": "system", "content": "Classify the user's intent for a customer service workflow."},
{"role": "user", "content": user_text},
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "intent_classification",
"schema": classification_schema,
"strict": True,
},
},
},
).json()
return json.loads(response["choices"][0]["message"]["content"])
classification = classify_utterance("I want to cancel my subscription right now.")
# {"intent": "cancel", "urgency": "high", "escalate": True, "summary": "..."}
if classification["escalate"]:
transfer_to_human()
You get back a parsed dict you can route on directly. Pair this with streaming for the user-facing reply: classify the intent (structured), then stream a conversational acknowledgment based on the classification. The user hears a friendly sentence in under a second while your code routes the call in the background.
Step 4: Wire it all together with streaming STT
The full pipeline looks like this—STT WebSocket on the inbound side, streamed LLM Gateway responses on the outbound side, with tool calls and structured outputs available when needed:
from assemblyai.streaming.v3 import (
StreamingClient,
StreamingClientOptions,
StreamingParameters,
StreamingEvents,
BeginEvent,
TurnEvent,
StreamingError,
)
conversation_history = [
{"role": "system", "content": "You are a helpful voice assistant."}
]
def on_turn(client: StreamingClient, event: TurnEvent):
if not event.end_of_turn:
return
user_text = event.transcript
print(f"\nUser: {user_text}")
classification = classify_utterance(user_text)
if classification["escalate"]:
speak("Let me get a human on the line for you.")
return
for sentence in respond_with_tools(user_text, conversation_history):
speak(sentence)
import assemblyai as aai
def main():
client = StreamingClient(
StreamingClientOptions(
api_key=ASSEMBLYAI_API_KEY,
api_host="streaming.assemblyai.com",
)
)
client.on(StreamingEvents.Turn, on_turn)
client.connect(
StreamingParameters(speech_model="u3-rt-pro", sample_rate=16000)
)
try:
client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
finally:
client.disconnect(terminate=True)
if __name__ == "__main__":
main()
Speak into your microphone, ask about an order ID, and watch the agent execute the tool call and stream the spoken reply back. The combination of streaming STT, streaming LLM, and tool calling produces the responsive voice experience users now expect.
When to use which technique
| Pattern | Use it when |
|---|---|
| Streaming reply only | The user asked a question; you want a fast, conversational answer |
| Tool calling + streamed reply | The agent needs to act on real data (order lookup, scheduling, transfers) |
| Structured outputs | You need machine-readable output for routing, classification, or downstream logic |
| Structured + streamed combo | Classify the intent in JSON, then stream a conversational acknowledgment to the user |
Skip the wiring with the Voice Agent API
Streaming, tools, structured outputs, and an STT-LLM-TTS pipeline tied together—if you're building a single voice agent and don't need to swap LLM providers per request, AssemblyAI's Voice Agent API bundles all of this behind one WebSocket. You set a system prompt, register tools, and get back streamed audio with built-in turn detection and barge-in. Same Universal-3 Pro Streaming foundation, same fallback patterns, no glue code.
The lower-level approach in this tutorial is the right call when you need maximum control—choosing different LLMs per request, applying custom retry logic, or running structured-output classification in parallel with the spoken reply. Both paths are first-class on AssemblyAI; pick the one that matches your constraint.
Streaming everything is the new baseline for voice. Tool calling and structured outputs are what turn a streaming chatbot into something that can actually do work. Build for both and your voice agent stops feeling like a demo.
Frequently asked questions
What does it mean to stream LLM responses in a voice pipeline?
Streaming LLM responses means receiving and processing the model's output token by token as it's generated, instead of waiting for the full response to complete. In a voice pipeline, streaming lets you forward each completed sentence to a text-to-speech engine the moment the LLM emits it—so the user hears the first sentence of the agent's reply while the LLM is still generating the second. This typically cuts perceived response time by 60–80% for any reply longer than two sentences.
How do I stream LLM responses through AssemblyAI's LLM Gateway?
Set stream: True in your chat/completions request and read the response as a Server-Sent Events (SSE) stream. Each chunk contains a partial token in the choices[0].delta.content field. Buffer tokens, watch for sentence-ending punctuation, and flush each completed sentence to your TTS engine as soon as it's ready. Streaming is supported on OpenAI models in LLM Gateway today.
How does tool calling work with the LLM Gateway?
Tool calling lets your voice agent invoke functions to access data or trigger actions—looking up an order, scheduling a callback, transferring to a human. Define your tools as JSON Schema in the tools array, and when the model decides to call one it returns a tool_calls field on the assistant message. You execute the tool, append the result to the conversation history, and call the Gateway again to let the model produce a spoken response that incorporates the tool output. The schema is OpenAI-compatible, so the same code works across Claude, GPT, and Gemini.
Can I get structured JSON outputs from the LLM Gateway for voice agents?
Yes—LLM Gateway supports structured outputs via JSON schema using the response_format parameter. This guarantees the model returns exactly the shape you specified, which is useful for intent classification, routing decisions, sentiment scoring, and any voice agent workflow that needs machine-readable output your downstream code can consume directly. A common voice pattern is to classify intent in JSON first, then stream a conversational acknowledgment back to the user while your code routes the call in the background.
What's the latency budget for a real-time voice agent using streamed LLM responses?
A well-tuned voice pipeline targets 600–900 ms from the moment the user stops speaking to the moment they hear the agent's first audio. That budget breaks down roughly as: 200–500 ms for STT finalization, 150–400 ms for LLM time-to-first-token, 200–400 ms for TTS time-to-first-audio, and 50–150 ms of network overhead. Streaming everything—STT transcripts, LLM tokens, TTS audio—is what makes hitting that budget achievable.
When should I use the Voice Agent API instead of wiring streaming STT and LLM Gateway separately?
Use the Voice Agent API when you're building a single voice agent and want one WebSocket that handles STT, LLM, TTS, turn detection, and tool calling out of the box. Use the lower-level streaming STT plus LLM Gateway approach when you need more control—choosing different LLMs per request, applying custom retry logic, or running structured-output classification in parallel with the spoken reply. Both options use the same Universal-3 Pro Streaming foundation, so accuracy is identical.
Does streaming work with tool calling and structured outputs?
Yes—streaming composes with both. With tool calling, the agent thinks for a moment, executes the tool, then streams the spoken result token by token. With structured outputs, you typically don't stream the JSON itself (you want the complete object before parsing) but you can stream a separate conversational acknowledgment to the user while the structured classification finalizes in parallel.
Top comments (0)