Mart Schweiger

Posted on Jun 15 • Originally published at assemblyai.com

What is speech-to-speech for voice agents?

#ai #voiceassistant #speechtotext #python

You've called a business and heard "press 1 for billing, press 2 for support" more times than you can count. That's IVR—interactive voice response—and it's been the standard for decades. But speech to speech flips the entire model. Instead of navigating a phone tree, you just talk. The agent listens, understands what you said, figures out what to do, and talks back. It's what makes a voice agent feel like a real conversation instead of a glorified menu system.

The shift matters because users don't want to adapt to machines. They want machines to adapt to them. And spoken language is the most natural interface humans have. Speech-to-speech voice agents meet people where they already are—talking.

So what's actually happening under the hood? How do you build one? And what are the tradeoffs you need to understand before shipping to production? Let's break it down.

What speech-to-speech actually means

Speech to speech is exactly what it sounds like: audio in, audio out. A user speaks to the agent, and the agent speaks back. No typing, no screen, no buttons. The entire interaction happens through spoken language.

That's the user experience. Under the hood, the system has to do three things: understand the spoken input, reason about a response, and produce spoken output. How it accomplishes those three steps is where things get interesting.

There are two fundamental approaches. The first is the cascaded architecture—also called chained STT-LLM-TTS—where three separate models each handle one step. Speech-to-text converts audio to text, an LLM generates a text response, and text-to-speech converts that response back to audio. The second is the end-to-end approach, where a single multimodal model processes audio directly and outputs audio without an intermediate text step.

Both can build a speech-to-speech voice agent. But they have very different characteristics in production, and understanding those differences is critical before you pick an architecture.

Cascaded architecture (STT to LLM to TTS)—the production standard

The cascaded architecture is how the vast majority of production voice agents work today. If you want to build a voice agent using the cascading architecture, this is the pattern you'll follow.

How it works

Three specialized models work in sequence. A dedicated speech-to-text model converts the user's audio into a text transcript. That transcript goes to an LLM—GPT-4o, Claude, Llama, whatever you choose—which generates a text response. Then a text-to-speech model converts that response into audio that gets played back to the user.

Each model does one thing and does it well. The STT model is optimized for accurate transcription. The LLM is optimized for reasoning and generation. The TTS model is optimized for natural-sounding speech. You get best-in-class performance at every stage because each component is purpose-built.

Why it dominates production

The cascaded approach wins in production for several reasons. First, observability. Because there's a text transcript between the STT and LLM stages, you can see exactly what the agent heard and exactly what it decided to say. When something goes wrong in a customer call, you can pinpoint whether the issue was a transcription error, a bad LLM response, or a TTS glitch. That kind of debugging is invaluable when you're running a contact center handling thousands of calls a day.

Second, flexibility. You can swap any component independently. Want to try a different LLM? Change it without touching your STT or TTS. Want to upgrade your speech recognition? Swap that layer without retraining everything else. This modularity means you're never locked into a single vendor's entire stack.

Third, accuracy on entities. Names, account numbers, addresses, product codes—these are the things that matter most in business conversations. Dedicated STT models with specialized vocabularies and entity recognition consistently outperform end-to-end models on these critical details. When a customer says their account number, you need to get it right the first time.

The orchestration challenge

The thing is, chaining three models together introduces complexity that a single model doesn't have. You need to handle:

Turn detection—knowing when the user has finished speaking so you can start generating a response. Jump in too early and you'll cut them off. Wait too long and the silence feels awkward.
Barge-in handling—what happens when the user starts talking while the agent is still speaking? You need to detect this, stop playback, and start processing the new input.
Latency management—keeping the total response time fast enough that the conversation feels natural.
Error recovery—what happens when the STT returns gibberish, or the LLM hallucinates, or the TTS fails mid-sentence?

Understanding the full voice agent architecture helps you anticipate these challenges before they become problems in production.

The latency budget

Every millisecond counts in a voice conversation. Here's a realistic breakdown of where time goes in a cascaded pipeline:

STT: 200–500ms (depends on utterance length and streaming implementation)
LLM: 150–400ms (time to first token, then streaming)
TTS: 200–400ms (time to first audio chunk)
Network overhead: 50–150ms (round trips between services)

Add those up and you get 600–900ms total with proper streaming. That's well within the range that feels conversational—most people won't notice a pause under one second.

But here's where it gets interesting: the key optimization is to stream everything. Don't wait for the full STT transcript before sending to the LLM. Don't wait for the full LLM response before starting TTS. Send sentence-by-sentence. Each stage should start processing the moment it has enough data, not when the previous stage is completely done. This is the difference between a 600ms response and a 2-second response.

Build voice agents with best-in-class STT

AssemblyAI's Voice Agent API and Universal-3 Pro Streaming give you the accuracy and speed voice agents demand. Start building today.

End-to-end speech-to-speech models

The alternative to cascading is having a single model handle the entire pipeline. These are multimodal models—like GPT-4o's voice mode—that take audio as input and produce audio as output without an explicit text intermediate step.

The promise

End-to-end models are compelling in theory. One model means potentially lower latency since there's no inter-model communication overhead. They can also preserve vocal nuances—tone, emphasis, emotion—that might get lost when converting to text and back. And the architecture is simpler: one model instead of three, fewer integration points, less orchestration code.

The reality

In practice, end-to-end models have significant limitations for production voice agents. Accuracy on entities—names, numbers, codes—is measurably worse than dedicated STT models. When your voice agent is booking appointments or looking up accounts, that gap matters.

Observability drops dramatically. Without a text transcript between stages, debugging becomes much harder. When a call goes wrong, you can't easily see whether the model misheard the user or generated a bad response. It's a black box.

Control is limited too. Want to add a custom vocabulary for your industry terms? Want to apply content filters to the LLM output? Want to use a different voice? With cascaded architecture, these are straightforward changes. With end-to-end models, you're at the mercy of what the model supports.

The current state: end-to-end speech-to-speech is promising and improving fast. But for most production use cases—especially in industries like healthcare, finance, and customer service where accuracy and auditability are non-negotiable—the cascaded architecture remains the right choice. If you're evaluating the broader voice AI stack for building agents, understanding this tradeoff is essential.

How to build a speech-to-speech voice agent with AssemblyAI

AssemblyAI gives you two paths depending on how much control you want. Both use Universal-3 Pro Streaming—the same best-in-class speech recognition engine—so your accuracy is the same either way. The difference is how much of the pipeline you want to manage yourself.

Option 1: Voice Agent API—the fast path

The Voice Agent API is the fastest way to build a speech-to-speech voice agent. It's a single WebSocket connection that handles STT, LLM, and TTS together. You stream audio in, you get audio back. That's it.

The API is built on Universal-3 Pro Streaming for the STT layer, so you get the same transcription accuracy you'd get building your own pipeline. It includes tool calling (so your agent can look up data, book appointments, transfer calls), turn detection, and session resumption if a connection drops.

Pricing is $4.50/hr flat—no per-component billing, no surprise charges for LLM tokens or TTS characters.

Here's what the WebSocket connection looks like:

const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
wsUrl.searchParams.set("token", token);
const ws = new WebSocket(wsUrl);

ws.addEventListener("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      system_prompt: "You are a helpful voice assistant.",
      greeting: "Hi! How can I help you today?",
      output: { voice: "ivy" },
    },
  }));
});

You configure the agent's personality through the system prompt, set a greeting so it speaks first, choose a voice, and start streaming audio. The API handles turn detection, barge-in, latency optimization, and error recovery. You focus on the logic; it handles the plumbing.

For a more detailed walkthrough, check out the guide on building a phone-based voice agent.

Try the Voice Agent API in minutes

See how speech-to-speech works with AssemblyAI's Voice Agent API. Stream audio in, get audio back—one WebSocket, zero orchestration headaches.

Try playground

Option 2: Build your own cascading pipeline

If you want full control over every component—your choice of LLM, your choice of TTS, your own orchestration logic—you can use Universal-3 Pro Streaming as the STT layer in your own pipeline. At $0.45/hr, it's the speech recognition backbone, and you bring everything else.

This approach works well with orchestration frameworks like LiveKit, Pipecat, and Vapi. You get AssemblyAI's transcription accuracy paired with whatever LLM and TTS providers make sense for your use case. If you're weighing your options for the text-to-speech layer, our comparison of TTS APIs can help.

Here's how you connect to Universal-3 Pro Streaming in Python:

from assemblyai.streaming.v3 import (
    StreamingClient, StreamingClientOptions,
    StreamingEvents, StreamingParameters,
    BeginEvent, TurnEvent, TerminationEvent,
)

client = StreamingClient(
    StreamingClientOptions(
        api_key="YOUR_ASSEMBLYAI_KEY",
        api_host="streaming.assemblyai.com",
    )
)

def on_turn(client, event):
    if event.end_of_turn:
        # Send to LLM, then TTS
        print(f"Final: {event.transcript}")

client.on(StreamingEvents.Turn, on_turn)
client.connect(StreamingParameters(
    speech_model="u3-rt-pro",
    sample_rate=16000,
))

The on_turn callback fires when the user finishes speaking. From there, you send the transcript to your LLM, get the response, pipe it through your TTS, and stream the audio back to the user. You own the full pipeline.

When choosing an STT API for voice agents, accuracy at low latency is the metric that matters. Universal-3 Pro Streaming consistently leads on both, which is why teams building custom pipelines use it as their foundation.

Performance requirements that matter

Building a speech-to-speech voice agent isn't just about getting the architecture right. Performance makes or breaks the user experience. Here's what to target.

Total response latency: under 1 second. Research consistently shows that conversational turn-taking feels natural when the pause is under 1 second. Go above 1.5 seconds and users start to feel like the agent is broken. AssemblyAI's Voice Agent API delivers roughly 1 second end-to-end.

Streaming architecture is non-negotiable. If you're waiting for each stage to fully complete before starting the next one, your latency will be unacceptable. Every component in your pipeline—STT, LLM, and TTS—needs to stream. Partial results from STT feed into the LLM prompt. Partial LLM output (sentence by sentence) feeds into TTS. The first audio chunk reaches the user while the LLM is still generating the rest of the response.

Turn detection accuracy above 95%. Nothing frustrates users more than an agent that constantly interrupts them or waits five seconds after they stop talking. Good turn detection uses a combination of silence duration, prosodic cues, and semantic completeness to decide when the user is done.

Word error rate under 10% for your domain. General-purpose speech recognition might give you 12–15% WER, but for AI voice agents handling real business conversations, you need domain-adapted models. Names, product codes, and industry jargon are where generic models fall apart.

The LLM Gateway pattern can also help you optimize the LLM layer—routing requests to the fastest or cheapest model depending on query complexity.

Get the best STT for voice agents

Universal-3 Pro Streaming delivers the accuracy and speed production voice agents require. Start with a free API key.

Picking the right path

The best API for building a speech-to-speech voice agent in 2026 depends on where you are in your journey. If you want to ship fast and don't need deep customization of every pipeline component, the Voice Agent API gets you there in hours, not weeks. One WebSocket, flat pricing, production-grade accuracy out of the box.

If you need full control—custom LLMs, specific TTS voices, proprietary orchestration logic, or integration with an existing framework—Universal-3 Pro Streaming gives you the best STT foundation to build on. Pair it with your own LLM and TTS choices and build a voice agent with a chained STT-LLM-TTS architecture that's tailored exactly to your requirements.

Either way, speech to speech is the foundation of every voice agent worth building. The cascaded architecture gives you the accuracy, observability, and control that production demands. End-to-end models will keep improving, but right now, the cascaded approach is how serious voice agents get built.

Start with the playground to see it in action. Then grab an API key and start building.

Frequently asked questions

What is speech to speech in voice agents?

Speech to speech is the core interaction pattern of a voice agent: the user speaks, the agent processes spoken language, and the agent speaks back. Audio in, audio out. It replaces traditional IVR phone trees with natural conversation. Under the hood, it typically uses a cascaded architecture where speech-to-text, an LLM, and text-to-speech work together in sequence, or less commonly, a single end-to-end model handles everything.

How do you build a speech-to-speech voice agent?

There are two main approaches. The fastest is using a managed API like AssemblyAI's Voice Agent API, which handles STT, LLM, and TTS through a single WebSocket—you stream audio in and get audio back. The other approach is building your own cascading pipeline by connecting a streaming STT service (like Universal-3 Pro Streaming), an LLM, and a TTS provider using an orchestration framework like LiveKit or Pipecat.

What is the cascading architecture for voice agents?

The cascading (or chained) architecture processes speech through three dedicated models in sequence: speech-to-text converts audio to text, an LLM generates a text response, and text-to-speech converts that response back to audio. It's the dominant architecture for production voice agents because it offers best-in-class accuracy at each stage, full observability through text transcripts, and the flexibility to swap any component independently.

What latency is acceptable for speech-to-speech voice agents?

The target is under 1 second total response time from when the user stops speaking to when they hear the first word of the agent's response. Research shows conversational pauses feel natural below 1 second. Above 1.5 seconds, users perceive the agent as slow or broken. Achieving this requires streaming at every stage—don't wait for one model to finish before starting the next. With proper streaming, a cascaded pipeline can consistently hit 600–900ms.

What is the best API for building a speech-to-speech voice agent?

For a fully managed solution, AssemblyAI's Voice Agent API offers a single WebSocket that handles the entire STT-LLM-TTS pipeline at $4.50/hr flat. It's built on Universal-3 Pro Streaming for best-in-class transcription accuracy. For teams that want to control each component, Universal-3 Pro Streaming at $0.45/hr provides the STT foundation, and you bring your own LLM and TTS. The right choice depends on whether you prioritize speed to market or full pipeline control.

Can speech-to-speech voice agents handle interruptions?

Yes—barge-in handling is a critical feature. When a user starts speaking while the agent is talking, the system needs to detect the interruption, stop the agent's audio playback, and start processing the new input. This requires real-time voice activity detection running continuously, not just during the agent's listening phase. AssemblyAI's Voice Agent API handles barge-in automatically. If you're building your own pipeline, you'll need to implement this in your orchestration layer.

DEV Community