Three architectural approaches exist for building voice agents:
- Chained pipelines using separate speech recognition, language processing, and synthesis components
- Speech-to-Speech (Half-Cascade) that processes native audio input, uses text-based language reasoning, and generates speech output
- Native Audio models that reason directly in audio space within a single neural network
Each makes different tradeoffs between flexibility, latency, cost, production readiness, and audio quality preservation.
Understanding Speech-to-Speech Voice Agent Architecture
What Is Speech-to-Speech?
Speech-to-speech voice agents process audio with minimal delay – 200-300 milliseconds from user speech to agent response.
Two approaches exist: half-cascade systems use native audio input processing with text-based language model reasoning and speech synthesis output. Native audio models handle everything within a single neural network that reasons directly in audio space.
Both encode incoming sound into vectors capturing linguistic content, tone, and emotion. They begin generating responses while the user is speaking or immediately after. Native audio maintains more audio information throughout processing, while half-cascade systems balance modularity with lower latency than chained architectures.
Chained vs. Speech-to-Speech Voice Agent Architecture
Chained pipelines follow a sequential flow: Voice → STT → LLM → TTS → Voice. Each component waits for the previous one to finish before processing. Speech-to-speech architectures stream input and output concurrently across the stack, reducing perceived delay in scenarios that involve rapid turn-taking or mid-utterance interactions.
| Aspect | Chained Voice Agent | Speech-to-Speech Voice Agent |
|---|---|---|
| STT Processing | Can stream partial transcripts, but waits for end-of-utterance to finalize | Continuously streams partial transcripts as user speaks |
| LLM Behavior | Waits for complete STT output before processing | Begins processing from partial input while user is still speaking |
| TTS Synthesis | Can stream audio chunks, but starts after LLM generates first chunks (TTFT) | Starts speaking immediately as first tokens are generated, fully streaming |
| Latency | Higher due to sequential handoffs between components | Lower – concurrent streaming across all components |
| Flexibility | High – easy to swap out STT, TTS, and LLM independently | Less flexible – components must support tight integration and real-time coordination |
| Risks / Challenges | Requires careful orchestration between components to minimize latency | Significantly higher cost (~10x chained pipeline); requires stream orchestration to avoid mishearing |
| User Experience | Structured and clear, but less dynamic; noticeable pauses between turns | Agent can begin replying before user finishes speaking; maintains emotional tone through audio processing |
| Best Use Cases | All use cases, especially when cost control and flexibility are priorities | Best when ultra-low latency is critical and budget allows (AI concierges, premium live support) |
| Technical Requirements | Moderate – most providers offer PaaS solutions; focus on linking components and fallback strategy | Moderate – cloud APIs handle infrastructure; high only if self-hosting open-source models |
Core Architectures for Voice AI Agents
Three fundamental architectural approaches exist for building voice AI agents. Each has distinct trade-offs in latency, flexibility, and naturalness:
1. Chained Pipeline (Cascaded STT→LLM→TTS Architecture)
Schema: Voice → STT → LLM → TTS → Voice
How it works:
The system converts speech to text, processes it through a language model, and turns it back into audio.
Pros:
- Easy to build and debug;
- Works well with existing LLM APIs;
- Reliable and predictable;
- High flexibility – easy to swap out STT, TTS, and LLM independently.
Cons:
- High latency since each component waits for the previous one to complete;
- Loses tone and emotion when converting to text;
- Less natural feel, limited interruptibility.
Example implementations: Deepgram STT + GPT-4.1 + Cartesia TTS, Gladia STT + Gemini 2.5 Flash + ElevenLabs TTS
2. Speech-to-Speech (Half-Cascade Architecture)
Schema: Voice → Audio Encoder → Text-based LLM → TTS → Voice
How it works:
The model processes audio input directly through an encoder, uses a text-based language model to reason and respond, then generates speech via synthesis. This combines native audio input with text-based reasoning and speech output.
Google and OpenAI use this half-cascade architecture, balancing speed, performance, and reliability. This works well for production use and tool integration.
Pros:
- Lower latency with streaming capability;
- Retains tone and prosody cues;
- Natural conversational flow;
- More interruptible than chained pipeline.
Cons:
- Still has a separate LLM reasoning layer (text-native);
- TTS quality is lower than specialized TTS models (e.g., ElevenLabs, Cartesia) – voice sounds less natural and expressive;
- Less flexible than fully modular approach.
Example systems: Google Gemini Live 2.5 Flash, OpenAI Realtime API (gpt-realtime), Ultravox
3. Native Audio Model (End-to-End Speech-to-Speech AI)
Schema: Voice → Unified Model → Voice
How it works:
A single model listens, reasons, and speaks – all within one neural network. It encodes audio into latent vectors that capture meaning, emotion, and acoustic context, then directly generates output audio from those same representations.
Pros:
- Very low latency (true real-time);
- Maintains emotional tone and voice consistency;
- Most natural conversational quality;
- Supports full-duplex with natural interruptions.
Cons:
- Hard to train and control;
- Opaque reasoning (no clear text layer);
- Needs huge, high-quality audio datasets;
- Limited flexibility for voice customization.
Example systems: Gemini 2.5 Flash Native Audio, VITA-Audio, SALMONN-Omni, Moshi by Kyutai Labs
Gemini 2.5 Flash Native Audio (gemini-2.5-flash-native-audio-preview) provides true native audio processing – reasoning and generating speech natively in audio space. It includes affective (emotion-aware) dialogue, proactive audio capabilities, and "thinking" features. This represents Google's experimental approach to end-to-end audio reasoning without text intermediation.
Available Speech-to-Speech Models & Platforms
Commercial APIs and open-source projects provide speech-to-speech voice agents in 2025:
Leading Proprietary Platforms
OpenAI and Google offer three production-ready speech-to-speech voice models:
| Feature | OpenAI Realtime API (gpt-realtime) | Google Gemini Live 2.5 Flash | Google Gemini 2.5 Flash Native Audio |
|---|---|---|---|
| Architecture Type | Half-Cascade (Speech-to-Speech) | Half-Cascade (Speech-to-Speech) | Native Audio (End-to-End) |
| Provider | OpenAI (also via Azure) | Google / DeepMind | Google / DeepMind |
| Model Type | Multimodal LLM with realtime audio streaming support | Multimodal flash LLM optimized for speed and interactivity | Native audio-to-audio model with affective dialogue and proactive audio capabilities |
| Latency – Time to First Token | ~280 ms | ~280 ms | ~200-250 ms (experimental) |
| Audio Input | Streaming audio via WebRTC + WebSocket API | Streaming audio via Multimodal Live API (likely gRPC-based) | Native audio streaming via Multimodal Live API |
| Token Generation Speed | ~70–100 tokens/second | ~155–160 tokens/second | N/A (generates audio directly, not token-based) |
| Hosting / Access | Cloud only (OpenAI API / Azure OpenAI Service) | Cloud only (Google AI Studio / Vertex AI) | Cloud only (Google AI Studio / Vertex AI) – Preview only |
| Developer Integration | Open-source reference stack with LiveKit + WebRTC + OpenAI Streaming API | Access via Google's Vertex AI or AI Studio; endpoint: gemini-2.5-flash-live-001 | Access via Google AI Studio; endpoint: gemini-2.5-flash-native-audio-preview |
| Multimodal Capabilities | Yes – audio input, speech output; also supports vision | Yes – audio, video, text input; supports images and rolling context in conversation | Yes – native audio reasoning with emotion awareness, "thinking" mode, proactive audio |
| Throughput Capacity | ~800K tokens/min, ~1,000 req/min (Azure OpenAI, realtime mode) | N/A (not publicly specified, but optimized for high concurrency and streaming) | N/A (experimental preview, not intended for production scale) |
| Production Readiness | Generally Available | Generally Available | Experimental Preview only – not production-ready |
Open-Source Alternatives
Two open-source projects offer alternatives to proprietary models:
| Feature | Ultravox (by Fixie.ai) | Moshi (by Kyutai Labs) |
|---|---|---|
| Architecture Type | Half-Cascade (Speech-to-Speech) | Native Audio (End-to-End) |
| Model Type | Multimodal LLM (audio + text encoder, outputs text) | Audio-to-audio LLM (integrated STT and TTS – speech in, speech out) |
| Architecture | Voice → LLM → Text (planned speech output in future versions) | Voice → LLM → Voice (fully integrated speech-to-speech pipeline) |
| Streaming Support | Streaming text output with low latency | Full-duplex streaming (supports overlap and interruption) |
| Time to First Token (TTFT) | ~190 ms (on smaller variant) | ~160 ms |
| Token Generation Speed | ~200+ tokens/sec | Not token-based; generates speech waveform directly |
| Base Models | Built on open LLMs (e.g., LLaMA 3 – 8B / 70B) | Proprietary foundation model trained by Kyutai |
| Audio Processing | Projects audio into same token space as text using custom audio encoder | End-to-end audio encoder and decoder (neural codec pipeline) |
| Output Type | Text (for now), with plans for speech token output | Audio (neural codec speech) |
| Hosting / Deployment | Self-hostable; requires GPU infra, especially for 70B variant | Self-hostable (heavy); public demo available at moshi.chat |
| Open-Source Status | Fully open: model weights, architecture, and code available on GitHub | Fully open: code and demos available; weights provided (early stage) |
| Extensibility | Can plug in any open-weight LLM; attach custom audio projector | Closed model structure for now; focused on turnkey audio-agent use |
| Use Case Fit | Voice-enabled bots with real-time understanding, using custom TTS for output | Full voice agents with natural interruptions and direct speech response |
Integration Frameworks and Tools
Integration frameworks include Pipecat (vendor-agnostic voice agent framework used by Daily.co), LiveKit (WebRTC streaming, used by OpenAI), and FastRTC (Python streaming audio). For comprehensive platform comparisons including deployment options and integration approaches, see the voice agent platform guide. Developers can assemble open-source speech recognition (Vosk, NeMo) and TTS (VITS, FastSpeech) components into speech-to-speech agents without using end-to-end models.
Performance Metrics That Matter
Three metrics determine voice agent performance: speed (time to first token), accuracy (word error rate), and processing efficiency (real-time factor).
Time to First Token (TTFT)
Time to First Token (TTFT) measures latency from end-of-user-speech to start-of-agent-speech. Current models achieve TTFT in the 200-300 millisecond range: Google's Gemini Flash logs ~280 ms, OpenAI's GPT-4o realtime ~250-300 ms. Human response latencies in conversation average around 200 ms.
Network latency affects cloud API measurements, so real-world TTFT runs higher than lab values. Published TTFT may be measured in controlled settings or end-to-end.
Lower TTFT is better, though extremely low values may indicate the model responds before fully processing user intent.
Word Error Rate (WER)
Word Error Rate (WER) measures the percentage of words incorrectly recognized in the transcript. Lower WER means more accurate transcription. Meta AI's research on streaming LLM-based ASR achieved ~3.0% WER on Librispeech test-clean (~7.4% on test-other) in real-time mode, approaching offline model accuracy.
Recognition errors can lead the LLM astray. Cloud providers publish WER on benchmarks, but real-world WER runs higher. Real-time agents may correct some ASR errors via context, though lower baseline WER remains preferable.
Domain adaptation through custom vocabulary or fine-tuning helps with specialized terminology.
Real-Time Factor (RTF)
Real-Time Factor (RTF) measures processing speed relative to input duration. RTF < 1.0 means the system processes faster than real time. Each component has its own RTF: STT engines typically process at 0.2× real time, LLMs generate at 50+ tokens/sec, modern TTS synthesizes at RTF 0.1 or better (10 seconds of speech generated in 1 second).
Systems must maintain RTF < 1 under load to prevent latency accumulation. Smaller models often achieve better RTF at the cost of language quality, making token generation speed a determining factor for ultra-low latency requirements.
Cost Analysis and Scalability for Speech-to-Speech Voice Agents
Speech-to-speech voice agent costs break down into five categories: cloud API usage, self-hosting compute, scalability limits, bandwidth, and enterprise overhead.
| Cost Category | Description | Examples / Benchmarks | Key Considerations |
|---|---|---|---|
| Usage-Based Pricing (Cloud APIs) | Pay-per-token/minute for any architecture (STT, LLM, TTS, or integrated multimodal) |
OpenAI Realtime: ~$0.30/min (baseline), increases significantly with turns Gemini Live: ~$0.22/min (baseline), increases with turns Gemini Native Audio: ~$0.50/min with typical conversation turns (experimental) Chained pipeline: ~$0.15/min (no context accumulation) |
Speech-to-speech models have context accumulation that dramatically increases costs with conversation turns; chained pipelines maintain consistent per-minute pricing |
| Compute Costs (Self-Hosting) | Run open-source models like Ultravox/Moshi on your own infra | Hosting Ultravox 70B may need A100/H100 GPU per concurrent session; GPU costs: ~$2–$3/hr (cloud) | Lower marginal cost at scale; Requires infra & DevOps team; Harder to spin up instantly |
| Scalability / Rate Limits | Limits on concurrent sessions, tokens per minute, request rate | OpenAI GPT-4o: 800K tokens/min, 1K requests/min; Enterprise: up to 30M tokens/min | Watch for WebSocket caps or long-lived session constraints; Request enterprise quotas if needed |
| Bandwidth Overhead | Cost of streaming audio data over network | ~8–64 kbps per stream; Telephony codecs (e.g. G.711 vs G.729) can affect costs | Minor cost per stream, but adds up at scale; Ensure egress limits aren't exceeded in cloud setups |
| Enterprise Overhead | SLAs, premium support, custom deployments, fallback systems | Regional/on-prem hosting; Redundancy systems (e.g. backup STT or fallback bots) | Adds reliability and control; Contractual/licensing complexity increases total cost of ownership |
Understanding Speech-to-Speech Pricing
Speech-to-speech models like GPT-4o Realtime and Gemini 2.5 Flash Live have different cost structures than chained STT/TTS pipelines. For detailed provider comparisons with latency benchmarks and accuracy metrics, see the complete STT and TTS selection guide. Three factors drive higher costs:
- Proprietary multimodal infrastructure – These models require specialized neural architectures that process audio natively, maintaining acoustic features throughout the pipeline rather than collapsing to text
- Cloud-only deployment – No self-hosting option means paying for enterprise-grade streaming infrastructure, low-latency global endpoints, and WebRTC/gRPC orchestration
- Advanced real-time capabilities – Support for interruptions, emotional tone preservation, and sub-300ms latency requires substantial compute resources per session
Real-world cost reports from OpenAI's developer community:
- $3 spent on "a few short test conversations" in the playground (simple questions like bedtime stories)
- $10 consumed during weekend integration testing, leading developers to call the API "unusable at the moment" due to cost
- Costs increase per minute as conversations get longer – in a 15-minute session, one developer reported $5.28 for audio input vs $0.65 for output. This happens because tokens accumulate in the context window, and the model re-charges for all previous tokens on each turn, making longer conversations disproportionately more expensive
User-reported costs differ from official per-minute estimates because actual costs depend on conversation length (context accumulation), system prompt size (larger prompts = more tokens per turn), and conversation complexity (more back-and-forth = more context to maintain). A 5-minute conversation might cost $0.30/min, while a 30-minute conversation could cost $1.50/min or more due to accumulated context.
Native Audio Models (Moshi, VITA-Audio) are early-stage and experimental. While they promise the lowest latency and most natural interactions, they are:
- Mostly research projects, not production-ready
- Require significant GPU resources for self-hosting (A100/H100 class)
- Lack the ecosystem support, tooling, and reliability of commercial offerings
- Limited voice customization and control compared to modular approaches
Match Cost Strategy to Deployment Scale
Early-stage projects with low volume benefit from cloud APIs: fast setup, predictable pricing, pay-per-use. As usage grows, self-hosting economics may improve, particularly when requiring tight control, data locality, or custom model tuning.
Enterprise scale depends on reliability, rate limits, support agreements, and long-term flexibility – not just price per minute. Total cost of ownership (TCO) includes processing minutes, bandwidth, DevOps effort, redundancy, and support.
Cost calculation for specific scenarios: average conversation length × conversations per day × per-minute pricing = monthly cost. Compare against self-hosting infrastructure investment. Monitor usage limits and enterprise tier requirements.
Technical Implementation Challenges in Speech-to-Speech Voice Agent Deployment
Deploying to production requires integrating streaming, connecting to telephony, handling noise, and orchestrating streams.
Streaming Integration (WebRTC, WebSockets, etc.)
Low latency requires appropriate streaming mechanisms. Three options: WebRTC, WebSockets, and streaming HTTP/gRPC.
WebRTC
Web Real-Time Communication is the standard for low-latency audio/video streaming in browsers and mobile apps. Uses UDP for fast transmission and handles packet loss gracefully. Both OpenAI and Google use WebRTC for client-side audio capture and playback.
Browser and mobile app interactions use WebRTC to send microphone audio to the server. Includes Acoustic Echo Cancellation (AEC), noise reduction, and automatic gain control (AGC). Libraries like LiveKit, mediasoup, or Twilio provide WebRTC integration.
WebSockets and gRPC
Server-side connections between application servers and AI services use persistent bidirectional connections. OpenAI's voice API uses WebSockets – client sends audio chunks and receives tokens continuously. Google's API uses gRPC streaming over HTTP/2.
Both provide continuous streams rather than discrete HTTP requests. Implementation requires proper binary audio frame handling and maintaining open connections for conversation duration.
Audio Encoding
Audio format choice depends on API requirements. PCM raw audio is simple but bulky. Opus codec (used by WebRTC) provides high quality at low bitrate, though not all APIs accept Opus packets. Some APIs accept WAV or FLAC frames.
Compressed codecs save bandwidth for mobile users. Phone calls use G.711 µ-law 8kHz, requiring transcoding to 16kHz linear PCM for most ASR systems (Whisper, DeepSpeech).
Latency Tuning
Streaming systems use buffers to smooth network variation. WebRTC jitter buffers trade smooth audio for added delay. Default WebRTC parameters suffice for most deployments.
WebSocket implementations send data immediately (20ms audio frame every 20ms) without batching. Most WebSocket libraries disable Nagle's algorithm by default to avoid delaying small packets.
Handling Network Issues
WebRTC handles packet loss through loss concealment, filling missing audio chunks with plausible noise. WebSocket implementations lack this but ASR systems handle minor gaps reasonably well on decent networks.
Output packet loss can cause audio blips. Some systems use redundant packets or forward error correction on unreliable networks.
Many implementations combine approaches: WebRTC from client to relay server, then WebSocket from server to AI API. OpenAI's example follows this pattern. WebRTC handles unpredictable client networks while WebSocket simplifies AI model interfacing.
Telephony Integration (8 kHz and PSTN)
Call Quality Challenges
Phone deployments reveal quality issues absent in web-based implementations. Standard PSTN uses 8 kHz audio (G.711 codec), which severely degrades both speech recognition accuracy and TTS naturalness compared to 16 kHz+ web audio.
Most high-quality ASR models (including GPT-4o Realtime, Gemini Live, Whisper) train primarily on 16 kHz audio, so 8 kHz telephony input reduces their accuracy significantly.
Provider Support
Twilio's standard codecs operate at 8 kHz with limited support for higher-quality audio streaming needed for AI models. Telnyx offers native 16 kHz support via G.722 wideband codec through their owned infrastructure, but requires more expertise to configure properly.
Speech-to-speech models (GPT-4o Realtime, Gemini Live) optimized for high-quality web audio don't perform as well over standard PSTN. Their latency and integration benefits disappear over phone while premium pricing remains. This makes chained STT/LLM/TTS pipelines with telephony-optimized components often more reliable and cost-effective for phone-based deployments.
SIP and VoIP Integration
Telephony integration uses services like Twilio, Nexmo, or on-premises SIP systems. These provide audio via WebSocket (Twilio streams 8k PCM in real time) or media servers. Architecture must ingest these streams and connect to the AI pipeline.
DTMF and Control
Telephony providers detect DTMF tones (touch-tone input) out-of-band to avoid confusing ASR. Twilio sends webhook events for DTMF. Speech-to-speech voice agents minimize DTMF menus but users may still attempt touch-tone input.
Telephony Latency
Phone networks add 100-200ms fixed latency. Processing pipelines should minimize additional overhead. Hosting AI services near telephony ingress points reduces roundtrip latency.
Human Agent Handoffs
Human agent handoffs benefit from passing conversation context. AI conversations that escalate after collecting information should provide transcribed summaries to avoid user repetition.
Handling Background Noise & Voice Variability
Noise Suppression
Noise suppression algorithms applied before ASR improve recognition accuracy. ML models like RNNoise remove background noise (keyboard sounds, fans) in real time. Picovoice's Koala demonstrates intelligibility improvements.
Tradeoff: slightly distorts voice and consumes extra CPU.
Microphone Differences
Audio quality varies across headsets, speakerphones, and car bluetooth (frequency response, echo). Echo cancellation prevents agent voice from being picked up by microphone. WebRTC's AEC handles most cases.
Telephone scenarios rely on network echo cancellers or require adaptive echo cancelers in the pipeline.
VAD and Barge-In
Voice Activity Detection (VAD) distinguishes speech from noise. Noisy conditions cause false positives/negatives. Combining VAD with ASR confidence improves accuracy. Treat silence as end-of-utterance only when ASR confirms finality.
Continue assuming speech while ASR generates transcribed words. End turn after 500ms silence. Barge-in requires monitoring microphone during agent speech to stop TTS when user interrupts.
Accents and Languages
Diverse user bases require testing across accents and dialects. Cloud ASRs support accent/locale specifications for improved accuracy. Open models benefit from fine-tuning on accented data.
Bilingual support requires models supporting multiple languages (Google, OpenAI). Multi-language detection works through auto-detection or routing to language-specific models.
Stream Management and Orchestration
Continuous conversation streams require managing concurrent input/output and conversation state.
Half-Duplex vs Full-Duplex
Most systems use half-duplex with barge-in – users can interrupt agents, but agents don't interrupt users except for short backchannel utterances ("uh-huh", "I see"). Backchannel implementation requires detecting pauses and generating quick responses without disrupting ASR.
Prompt Management
Persistent conversation state requires maintaining rolling prompts for the LLM. APIs with persistent sessions handle this up to context limits. Manual implementations append each utterance and reply.
Long conversations require summarizing older content to stay within context windows. Important user-provided facts need re-injection into prompts as needed.
Ensuring Required Steps
Flows requiring specific actions (identity verification, mandatory questions) benefit from checkpoints. Teams can implement checkpoints through LLM prompt instructions or external state machines.
Some systems prevent sending queries to LLM until prerequisite steps complete, or override LLM responses that skip required actions. This combines rule-based flow with AI – trusting AI for understanding and generation while enforcing action sequences.
Speech-to-speech agent orchestration requires managing concurrent input/output streams. Best practices and libraries exist for common patterns. Testing should include scenarios like users interrupting agent speech to verify barge-in logic stops TTS promptly.
Conclusion
Speech-to-speech voice agents reduce latency to 200-300ms, approaching human response times. Proprietary platforms (OpenAI GPT-4o, Google Gemini 2.5 Flash) and open-source options (Ultravox, Moshi) are production-ready or nearing maturity.
Architecture choice depends on deployment environment and constraints:
- Chained Pipeline – Voice → STT → LLM → TTS → Voice – provides maximum flexibility and reliability but higher latency
- Speech-to-Speech (Half-Cascade) – Voice → Audio Encoder → Text-based LLM → TTS → Voice – balances performance with production readiness, but at significantly higher cost
- Native Audio – Voice → Unified Model → Voice – offers the lowest latency and most natural interactions, but remains experimental and not production-ready
Implementation factors:
- Performance requirements: TTFT, WER, and RTF targets for the use case
- Cost structure: Cloud APIs vs. self-hosting economics at expected scale
- Technical complexity: Streaming integration, telephony connectivity, noise handling
- Deployment environment: Phone systems (8kHz PSTN) vs. web-based (16kHz+ audio)
System design includes audio streaming, orchestration, testing, and optimization for specific constraints. Cloud APIs enable rapid prototyping. Production deployment requires testing with real user patterns and audio conditions.
Top comments (0)