Elijah Atamas for Softcery

Posted on Nov 5 • Originally published at softcery.com on Oct 24

Real-Time (S2S) vs Cascading (STT/TTS) Voice Agent Architecture

#systemdesign #ai #architecture #machinelearning

Three architectural approaches exist for building voice agents:

Chained pipelines using separate speech recognition, language processing, and synthesis components
Speech-to-Speech (Half-Cascade) that processes native audio input, uses text-based language reasoning, and generates speech output
Native Audio models that reason directly in audio space within a single neural network

Each makes different tradeoffs between flexibility, latency, cost, production readiness, and audio quality preservation.

Understanding Speech-to-Speech Voice Agent Architecture

What Is Speech-to-Speech?

Speech-to-speech voice agents process audio with minimal delay – 200-300 milliseconds from user speech to agent response.

Two approaches exist: half-cascade systems use native audio input processing with text-based language model reasoning and speech synthesis output. Native audio models handle everything within a single neural network that reasons directly in audio space.

Both encode incoming sound into vectors capturing linguistic content, tone, and emotion. They begin generating responses while the user is speaking or immediately after. Native audio maintains more audio information throughout processing, while half-cascade systems balance modularity with lower latency than chained architectures.

Chained vs. Speech-to-Speech Voice Agent Architecture

Chained pipelines follow a sequential flow: Voice → STT → LLM → TTS → Voice. Each component waits for the previous one to finish before processing. Speech-to-speech architectures stream input and output concurrently across the stack, reducing perceived delay in scenarios that involve rapid turn-taking or mid-utterance interactions.

Aspect	Chained Voice Agent	Speech-to-Speech Voice Agent
STT Processing	Can stream partial transcripts, but waits for end-of-utterance to finalize	Continuously streams partial transcripts as user speaks
LLM Behavior	Waits for complete STT output before processing	Begins processing from partial input while user is still speaking
TTS Synthesis	Can stream audio chunks, but starts after LLM generates first chunks (TTFT)	Starts speaking immediately as first tokens are generated, fully streaming
Latency	Higher due to sequential handoffs between components	Lower – concurrent streaming across all components
Flexibility	High – easy to swap out STT, TTS, and LLM independently	Less flexible – components must support tight integration and real-time coordination
Risks / Challenges	Requires careful orchestration between components to minimize latency	Significantly higher cost (~10x chained pipeline); requires stream orchestration to avoid mishearing
User Experience	Structured and clear, but less dynamic; noticeable pauses between turns	Agent can begin replying before user finishes speaking; maintains emotional tone through audio processing
Best Use Cases	All use cases, especially when cost control and flexibility are priorities	Best when ultra-low latency is critical and budget allows (AI concierges, premium live support)
Technical Requirements	Moderate – most providers offer PaaS solutions; focus on linking components and fallback strategy	Moderate – cloud APIs handle infrastructure; high only if self-hosting open-source models

Core Architectures for Voice AI Agents

Three fundamental architectural approaches exist for building voice AI agents. Each has distinct trade-offs in latency, flexibility, and naturalness:

1. Chained Pipeline (Cascaded STT→LLM→TTS Architecture)

Schema: Voice → STT → LLM → TTS → Voice

How it works:
The system converts speech to text, processes it through a language model, and turns it back into audio.

Pros:

Easy to build and debug;
Works well with existing LLM APIs;
Reliable and predictable;
High flexibility – easy to swap out STT, TTS, and LLM independently.

Cons:

High latency since each component waits for the previous one to complete;
Loses tone and emotion when converting to text;
Less natural feel, limited interruptibility.

Example implementations: Deepgram STT + GPT-4.1 + Cartesia TTS, Gladia STT + Gemini 2.5 Flash + ElevenLabs TTS

2. Speech-to-Speech (Half-Cascade Architecture)

Schema: Voice → Audio Encoder → Text-based LLM → TTS → Voice

How it works:
The model processes audio input directly through an encoder, uses a text-based language model to reason and respond, then generates speech via synthesis. This combines native audio input with text-based reasoning and speech output.

Google and OpenAI use this half-cascade architecture, balancing speed, performance, and reliability. This works well for production use and tool integration.

Pros:

Lower latency with streaming capability;
Retains tone and prosody cues;
Natural conversational flow;
More interruptible than chained pipeline.

Cons:

Still has a separate LLM reasoning layer (text-native);
TTS quality is lower than specialized TTS models (e.g., ElevenLabs, Cartesia) – voice sounds less natural and expressive;
Less flexible than fully modular approach.

Example systems: Google Gemini Live 2.5 Flash, OpenAI Realtime API (gpt-realtime), Ultravox

3. Native Audio Model (End-to-End Speech-to-Speech AI)

Schema: Voice → Unified Model → Voice

How it works:
A single model listens, reasons, and speaks – all within one neural network. It encodes audio into latent vectors that capture meaning, emotion, and acoustic context, then directly generates output audio from those same representations.

Pros:

Very low latency (true real-time);
Maintains emotional tone and voice consistency;
Most natural conversational quality;
Supports full-duplex with natural interruptions.

Cons:

Hard to train and control;
Opaque reasoning (no clear text layer);
Needs huge, high-quality audio datasets;
Limited flexibility for voice customization.

Example systems: Gemini 2.5 Flash Native Audio, VITA-Audio, SALMONN-Omni, Moshi by Kyutai Labs

Gemini 2.5 Flash Native Audio (gemini-2.5-flash-native-audio-preview) provides true native audio processing – reasoning and generating speech natively in audio space. It includes affective (emotion-aware) dialogue, proactive audio capabilities, and "thinking" features. This represents Google's experimental approach to end-to-end audio reasoning without text intermediation.

Available Speech-to-Speech Models & Platforms

Commercial APIs and open-source projects provide speech-to-speech voice agents in 2025:

Leading Proprietary Platforms

OpenAI and Google offer three production-ready speech-to-speech voice models:

Feature	OpenAI Realtime API (gpt-realtime)	Google Gemini Live 2.5 Flash	Google Gemini 2.5 Flash Native Audio
Architecture Type	Half-Cascade (Speech-to-Speech)	Half-Cascade (Speech-to-Speech)	Native Audio (End-to-End)
Provider	OpenAI (also via Azure)	Google / DeepMind	Google / DeepMind
Model Type	Multimodal LLM with realtime audio streaming support	Multimodal flash LLM optimized for speed and interactivity	Native audio-to-audio model with affective dialogue and proactive audio capabilities
Latency – Time to First Token	~280 ms	~280 ms	~200-250 ms (experimental)
Audio Input	Streaming audio via WebRTC + WebSocket API	Streaming audio via Multimodal Live API (likely gRPC-based)	Native audio streaming via Multimodal Live API
Token Generation Speed	~70–100 tokens/second	~155–160 tokens/second	N/A (generates audio directly, not token-based)
Hosting / Access	Cloud only (OpenAI API / Azure OpenAI Service)	Cloud only (Google AI Studio / Vertex AI)	Cloud only (Google AI Studio / Vertex AI) – Preview only
Developer Integration	Open-source reference stack with LiveKit + WebRTC + OpenAI Streaming API	Access via Google's Vertex AI or AI Studio; endpoint: gemini-2.5-flash-live-001	Access via Google AI Studio; endpoint: gemini-2.5-flash-native-audio-preview
Multimodal Capabilities	Yes – audio input, speech output; also supports vision	Yes – audio, video, text input; supports images and rolling context in conversation	Yes – native audio reasoning with emotion awareness, "thinking" mode, proactive audio
Throughput Capacity	~800K tokens/min, ~1,000 req/min (Azure OpenAI, realtime mode)	N/A (not publicly specified, but optimized for high concurrency and streaming)	N/A (experimental preview, not intended for production scale)
Production Readiness	Generally Available	Generally Available	Experimental Preview only – not production-ready

Open-Source Alternatives

Two open-source projects offer alternatives to proprietary models:

Feature	Ultravox (by Fixie.ai)	Moshi (by Kyutai Labs)
Architecture Type	Half-Cascade (Speech-to-Speech)	Native Audio (End-to-End)
Model Type	Multimodal LLM (audio + text encoder, outputs text)	Audio-to-audio LLM (integrated STT and TTS – speech in, speech out)
Architecture	Voice → LLM → Text (planned speech output in future versions)	Voice → LLM → Voice (fully integrated speech-to-speech pipeline)
Streaming Support	Streaming text output with low latency	Full-duplex streaming (supports overlap and interruption)
Time to First Token (TTFT)	~190 ms (on smaller variant)	~160 ms
Token Generation Speed	~200+ tokens/sec	Not token-based; generates speech waveform directly
Base Models	Built on open LLMs (e.g., LLaMA 3 – 8B / 70B)	Proprietary foundation model trained by Kyutai
Audio Processing	Projects audio into same token space as text using custom audio encoder	End-to-end audio encoder and decoder (neural codec pipeline)
Output Type	Text (for now), with plans for speech token output	Audio (neural codec speech)
Hosting / Deployment	Self-hostable; requires GPU infra, especially for 70B variant	Self-hostable (heavy); public demo available at moshi.chat
Open-Source Status	Fully open: model weights, architecture, and code available on GitHub	Fully open: code and demos available; weights provided (early stage)
Extensibility	Can plug in any open-weight LLM; attach custom audio projector	Closed model structure for now; focused on turnkey audio-agent use
Use Case Fit	Voice-enabled bots with real-time understanding, using custom TTS for output	Full voice agents with natural interruptions and direct speech response

Integration Frameworks and Tools

Integration frameworks include Pipecat (vendor-agnostic voice agent framework used by Daily.co), LiveKit (WebRTC streaming, used by OpenAI), and FastRTC (Python streaming audio). For comprehensive platform comparisons including deployment options and integration approaches, see the voice agent platform guide. Developers can assemble open-source speech recognition (Vosk, NeMo) and TTS (VITS, FastSpeech) components into speech-to-speech agents without using end-to-end models.

Performance Metrics That Matter

Three metrics determine voice agent performance: speed (time to first token), accuracy (word error rate), and processing efficiency (real-time factor).

Time to First Token (TTFT)

Time to First Token (TTFT) measures latency from end-of-user-speech to start-of-agent-speech. Current models achieve TTFT in the 200-300 millisecond range: Google's Gemini Flash logs ~280 ms, OpenAI's GPT-4o realtime ~250-300 ms. Human response latencies in conversation average around 200 ms.

Network latency affects cloud API measurements, so real-world TTFT runs higher than lab values. Published TTFT may be measured in controlled settings or end-to-end.

Lower TTFT is better, though extremely low values may indicate the model responds before fully processing user intent.

Word Error Rate (WER)

Word Error Rate (WER) measures the percentage of words incorrectly recognized in the transcript. Lower WER means more accurate transcription. Meta AI's research on streaming LLM-based ASR achieved ~3.0% WER on Librispeech test-clean (~7.4% on test-other) in real-time mode, approaching offline model accuracy.

Recognition errors can lead the LLM astray. Cloud providers publish WER on benchmarks, but real-world WER runs higher. Real-time agents may correct some ASR errors via context, though lower baseline WER remains preferable.

Domain adaptation through custom vocabulary or fine-tuning helps with specialized terminology.

Real-Time Factor (RTF)

Real-Time Factor (RTF) measures processing speed relative to input duration. RTF < 1.0 means the system processes faster than real time. Each component has its own RTF: STT engines typically process at 0.2× real time, LLMs generate at 50+ tokens/sec, modern TTS synthesizes at RTF 0.1 or better (10 seconds of speech generated in 1 second).

Systems must maintain RTF < 1 under load to prevent latency accumulation. Smaller models often achieve better RTF at the cost of language quality, making token generation speed a determining factor for ultra-low latency requirements.

Cost Analysis and Scalability for Speech-to-Speech Voice Agents

Speech-to-speech voice agent costs break down into five categories: cloud API usage, self-hosting compute, scalability limits, bandwidth, and enterprise overhead.

Cost Category	Description	Examples / Benchmarks	Key Considerations
Usage-Based Pricing (Cloud APIs)	Pay-per-token/minute for any architecture (STT, LLM, TTS, or integrated multimodal)	OpenAI Realtime: ~$0.30/min (baseline), increases significantly with turns Gemini Live: ~$0.22/min (baseline), increases with turns Gemini Native Audio: ~$0.50/min with typical conversation turns (experimental) Chained pipeline: ~$0.15/min (no context accumulation)	Speech-to-speech models have context accumulation that dramatically increases costs with conversation turns; chained pipelines maintain consistent per-minute pricing
Compute Costs (Self-Hosting)	Run open-source models like Ultravox/Moshi on your own infra	Hosting Ultravox 70B may need A100/H100 GPU per concurrent session; GPU costs: ~$2–$3/hr (cloud)	Lower marginal cost at scale; Requires infra & DevOps team; Harder to spin up instantly
Scalability / Rate Limits	Limits on concurrent sessions, tokens per minute, request rate	OpenAI GPT-4o: 800K tokens/min, 1K requests/min; Enterprise: up to 30M tokens/min	Watch for WebSocket caps or long-lived session constraints; Request enterprise quotas if needed
Bandwidth Overhead	Cost of streaming audio data over network	~8–64 kbps per stream; Telephony codecs (e.g. G.711 vs G.729) can affect costs	Minor cost per stream, but adds up at scale; Ensure egress limits aren't exceeded in cloud setups
Enterprise Overhead	SLAs, premium support, custom deployments, fallback systems	Regional/on-prem hosting; Redundancy systems (e.g. backup STT or fallback bots)	Adds reliability and control; Contractual/licensing complexity increases total cost of ownership

Understanding Speech-to-Speech Pricing

Speech-to-speech models like GPT-4o Realtime and Gemini 2.5 Flash Live have different cost structures than chained STT/TTS pipelines. For detailed provider comparisons with latency benchmarks and accuracy metrics, see the complete STT and TTS selection guide. Three factors drive higher costs:

Proprietary multimodal infrastructure – These models require specialized neural architectures that process audio natively, maintaining acoustic features throughout the pipeline rather than collapsing to text
Cloud-only deployment – No self-hosting option means paying for enterprise-grade streaming infrastructure, low-latency global endpoints, and WebRTC/gRPC orchestration
Advanced real-time capabilities – Support for interruptions, emotional tone preservation, and sub-300ms latency requires substantial compute resources per session

Real-world cost reports from OpenAI's developer community:

$3 spent on "a few short test conversations" in the playground (simple questions like bedtime stories)
$10 consumed during weekend integration testing, leading developers to call the API "unusable at the moment" due to cost
Costs increase per minute as conversations get longer – in a 15-minute session, one developer reported $5.28 for audio input vs $0.65 for output. This happens because tokens accumulate in the context window, and the model re-charges for all previous tokens on each turn, making longer conversations disproportionately more expensive

User-reported costs differ from official per-minute estimates because actual costs depend on conversation length (context accumulation), system prompt size (larger prompts = more tokens per turn), and conversation complexity (more back-and-forth = more context to maintain). A 5-minute conversation might cost $0.30/min, while a 30-minute conversation could cost $1.50/min or more due to accumulated context.

Native Audio Models (Moshi, VITA-Audio) are early-stage and experimental. While they promise the lowest latency and most natural interactions, they are:

Mostly research projects, not production-ready
Require significant GPU resources for self-hosting (A100/H100 class)
Lack the ecosystem support, tooling, and reliability of commercial offerings
Limited voice customization and control compared to modular approaches

Match Cost Strategy to Deployment Scale

Early-stage projects with low volume benefit from cloud APIs: fast setup, predictable pricing, pay-per-use. As usage grows, self-hosting economics may improve, particularly when requiring tight control, data locality, or custom model tuning.

Enterprise scale depends on reliability, rate limits, support agreements, and long-term flexibility – not just price per minute. Total cost of ownership (TCO) includes processing minutes, bandwidth, DevOps effort, redundancy, and support.

Cost calculation for specific scenarios: average conversation length × conversations per day × per-minute pricing = monthly cost. Compare against self-hosting infrastructure investment. Monitor usage limits and enterprise tier requirements.

Technical Implementation Challenges in Speech-to-Speech Voice Agent Deployment

Deploying to production requires integrating streaming, connecting to telephony, handling noise, and orchestrating streams.

Streaming Integration (WebRTC, WebSockets, etc.)

Low latency requires appropriate streaming mechanisms. Three options: WebRTC, WebSockets, and streaming HTTP/gRPC.

WebRTC

Web Real-Time Communication is the standard for low-latency audio/video streaming in browsers and mobile apps. Uses UDP for fast transmission and handles packet loss gracefully. Both OpenAI and Google use WebRTC for client-side audio capture and playback.

Browser and mobile app interactions use WebRTC to send microphone audio to the server. Includes Acoustic Echo Cancellation (AEC), noise reduction, and automatic gain control (AGC). Libraries like LiveKit, mediasoup, or Twilio provide WebRTC integration.

WebSockets and gRPC

Server-side connections between application servers and AI services use persistent bidirectional connections. OpenAI's voice API uses WebSockets – client sends audio chunks and receives tokens continuously. Google's API uses gRPC streaming over HTTP/2.

Both provide continuous streams rather than discrete HTTP requests. Implementation requires proper binary audio frame handling and maintaining open connections for conversation duration.

Audio Encoding

Audio format choice depends on API requirements. PCM raw audio is simple but bulky. Opus codec (used by WebRTC) provides high quality at low bitrate, though not all APIs accept Opus packets. Some APIs accept WAV or FLAC frames.

Compressed codecs save bandwidth for mobile users. Phone calls use G.711 µ-law 8kHz, requiring transcoding to 16kHz linear PCM for most ASR systems (Whisper, DeepSpeech).

Latency Tuning

Streaming systems use buffers to smooth network variation. WebRTC jitter buffers trade smooth audio for added delay. Default WebRTC parameters suffice for most deployments.

WebSocket implementations send data immediately (20ms audio frame every 20ms) without batching. Most WebSocket libraries disable Nagle's algorithm by default to avoid delaying small packets.

Handling Network Issues

WebRTC handles packet loss through loss concealment, filling missing audio chunks with plausible noise. WebSocket implementations lack this but ASR systems handle minor gaps reasonably well on decent networks.

Output packet loss can cause audio blips. Some systems use redundant packets or forward error correction on unreliable networks.

Many implementations combine approaches: WebRTC from client to relay server, then WebSocket from server to AI API. OpenAI's example follows this pattern. WebRTC handles unpredictable client networks while WebSocket simplifies AI model interfacing.

Telephony Integration (8 kHz and PSTN)

Call Quality Challenges

Phone deployments reveal quality issues absent in web-based implementations. Standard PSTN uses 8 kHz audio (G.711 codec), which severely degrades both speech recognition accuracy and TTS naturalness compared to 16 kHz+ web audio.

Most high-quality ASR models (including GPT-4o Realtime, Gemini Live, Whisper) train primarily on 16 kHz audio, so 8 kHz telephony input reduces their accuracy significantly.

Provider Support

Twilio's standard codecs operate at 8 kHz with limited support for higher-quality audio streaming needed for AI models. Telnyx offers native 16 kHz support via G.722 wideband codec through their owned infrastructure, but requires more expertise to configure properly.

Speech-to-speech models (GPT-4o Realtime, Gemini Live) optimized for high-quality web audio don't perform as well over standard PSTN. Their latency and integration benefits disappear over phone while premium pricing remains. This makes chained STT/LLM/TTS pipelines with telephony-optimized components often more reliable and cost-effective for phone-based deployments.

SIP and VoIP Integration

Telephony integration uses services like Twilio, Nexmo, or on-premises SIP systems. These provide audio via WebSocket (Twilio streams 8k PCM in real time) or media servers. Architecture must ingest these streams and connect to the AI pipeline.

DTMF and Control

Telephony providers detect DTMF tones (touch-tone input) out-of-band to avoid confusing ASR. Twilio sends webhook events for DTMF. Speech-to-speech voice agents minimize DTMF menus but users may still attempt touch-tone input.

Telephony Latency

Phone networks add 100-200ms fixed latency. Processing pipelines should minimize additional overhead. Hosting AI services near telephony ingress points reduces roundtrip latency.

Human Agent Handoffs

Human agent handoffs benefit from passing conversation context. AI conversations that escalate after collecting information should provide transcribed summaries to avoid user repetition.

Handling Background Noise & Voice Variability

Noise Suppression

Noise suppression algorithms applied before ASR improve recognition accuracy. ML models like RNNoise remove background noise (keyboard sounds, fans) in real time. Picovoice's Koala demonstrates intelligibility improvements.

Tradeoff: slightly distorts voice and consumes extra CPU.

Microphone Differences

Audio quality varies across headsets, speakerphones, and car bluetooth (frequency response, echo). Echo cancellation prevents agent voice from being picked up by microphone. WebRTC's AEC handles most cases.

Telephone scenarios rely on network echo cancellers or require adaptive echo cancelers in the pipeline.

VAD and Barge-In

Voice Activity Detection (VAD) distinguishes speech from noise. Noisy conditions cause false positives/negatives. Combining VAD with ASR confidence improves accuracy. Treat silence as end-of-utterance only when ASR confirms finality.

Continue assuming speech while ASR generates transcribed words. End turn after 500ms silence. Barge-in requires monitoring microphone during agent speech to stop TTS when user interrupts.

Accents and Languages

Diverse user bases require testing across accents and dialects. Cloud ASRs support accent/locale specifications for improved accuracy. Open models benefit from fine-tuning on accented data.

Bilingual support requires models supporting multiple languages (Google, OpenAI). Multi-language detection works through auto-detection or routing to language-specific models.

Stream Management and Orchestration

Continuous conversation streams require managing concurrent input/output and conversation state.

Half-Duplex vs Full-Duplex

Most systems use half-duplex with barge-in – users can interrupt agents, but agents don't interrupt users except for short backchannel utterances ("uh-huh", "I see"). Backchannel implementation requires detecting pauses and generating quick responses without disrupting ASR.

Prompt Management

Persistent conversation state requires maintaining rolling prompts for the LLM. APIs with persistent sessions handle this up to context limits. Manual implementations append each utterance and reply.

Long conversations require summarizing older content to stay within context windows. Important user-provided facts need re-injection into prompts as needed.

Ensuring Required Steps

Flows requiring specific actions (identity verification, mandatory questions) benefit from checkpoints. Teams can implement checkpoints through LLM prompt instructions or external state machines.

Some systems prevent sending queries to LLM until prerequisite steps complete, or override LLM responses that skip required actions. This combines rule-based flow with AI – trusting AI for understanding and generation while enforcing action sequences.

Speech-to-speech agent orchestration requires managing concurrent input/output streams. Best practices and libraries exist for common patterns. Testing should include scenarios like users interrupting agent speech to verify barge-in logic stops TTS promptly.

Conclusion

Speech-to-speech voice agents reduce latency to 200-300ms, approaching human response times. Proprietary platforms (OpenAI GPT-4o, Google Gemini 2.5 Flash) and open-source options (Ultravox, Moshi) are production-ready or nearing maturity.

Architecture choice depends on deployment environment and constraints:

Chained Pipeline – Voice → STT → LLM → TTS → Voice – provides maximum flexibility and reliability but higher latency
Speech-to-Speech (Half-Cascade) – Voice → Audio Encoder → Text-based LLM → TTS → Voice – balances performance with production readiness, but at significantly higher cost
Native Audio – Voice → Unified Model → Voice – offers the lowest latency and most natural interactions, but remains experimental and not production-ready

Implementation factors:

Performance requirements: TTFT, WER, and RTF targets for the use case
Cost structure: Cloud APIs vs. self-hosting economics at expected scale
Technical complexity: Streaming integration, telephony connectivity, noise handling
Deployment environment: Phone systems (8kHz PSTN) vs. web-based (16kHz+ audio)

System design includes audio streaming, orchestration, testing, and optimization for specific constraints. Cloud APIs enable rapid prototyping. Production deployment requires testing with real user patterns and audio conditions.