The Discovery That Changed Everything
When I first started building voice agents, I used VAPI and LiveKit - two powerful platforms that seemed like the obvious choices for real-time voice interactions. I spent considerable time integrating Speech-to-Text (STT) providers, configuring Text-to-Speech (TTS) services, and connecting them to LLMs. It was complex, but it worked.
Then I discovered Google ADK and something remarkable happened: I built a fully functional voice agent without any STT or TTS providers. Just the LLM. This discovery led me down a rabbit hole of understanding that fundamentally changed how I think about voice agent architecture.
In this blog post, I'll share what I learned about native audio models versus traditional STT/TTS pipelines, and help you understand when to use each approach.
My Initial Experience: VAPI and LiveKit
The Traditional Pipeline
When building voice agents with VAPI and LiveKit, I had to manage a complete pipeline:
User Audio → STT Provider → Text → LLM → Text → TTS Provider → Audio → User
What I had to configure:
-
STT Provider (Speech-to-Text)
- Options: Deepgram, AssemblyAI, Google Speech-to-Text, Azure Speech
- Configuration: API keys, language settings, accuracy tuning
- Cost: Per-minute pricing
-
LLM (Large Language Model)
- Options: GPT-4, Claude, Llama, etc.
- Configuration: API keys, model selection, prompt engineering
- Cost: Per-token pricing
-
TTS Provider (Text-to-Speech)
- Options: ElevenLabs, Azure TTS, Google TTS, etc.
- Configuration: Voice selection, language settings, quality settings
- Cost: Per-character or per-minute pricing
The Challenges:
- Complexity: Managing three separate services and their integrations
- Latency: Multiple conversion steps added delay
- Cost: Paying for three different services
- Language Support: Had to configure STT/TTS for each language
- Error Handling: Failures could occur at any point in the pipeline
- Synchronization: Ensuring audio/text alignment across services
It worked, but it felt like I was building a Rube Goldberg machine when I just wanted a voice conversation.
The ADK Revelation: Native Audio Models
Discovering Native Audio
When I started using Google ADK with Gemini models, I noticed something unusual: there was no STT or TTS configuration. I could send raw audio directly to the model and receive audio back. How was this possible?
The answer: Native Audio Models.
What Are Native Audio Models?
Native audio models are AI models that understand and generate audio directly, without converting to text first. They process raw PCM audio end-to-end, similar to how humans process speech.
Models that support native audio:
-
Google Gemini:
gemini-2.0-flash-live-001,gemini-live-2.5-flash-preview-native-audio-09-2025 - OpenAI: GPT-4o Realtime API, GPT-4o-Audio-Preview
The Simplified Architecture
With ADK and native audio models, the pipeline becomes dramatically simpler:
User Audio → Native Audio Model → User Audio
What I configured:
-
LLM Only (with native audio support)
- Model: Gemini Live API models
- Configuration: Just the model and API key
- Cost: Single pricing model
That's it. No STT. No TTS. Just the model.
Code Example: ADK Native Audio
Here's how simple it is to send audio with ADK:
from google.adk.agents import Agent
from google.adk.streaming import start_agent_session, RunConfig, StreamingMode
# Create agent with native audio model
root_agent = Agent(
name="voice_agent",
model="gemini-2.0-flash-live-001", # Native audio model
description="A voice assistant",
instruction="Have natural conversations with users",
)
# Start streaming session
run_config = RunConfig(
streaming_mode=StreamingMode.BIDI,
response_modalities=["AUDIO"], # Direct audio modality
output_audio_transcription=types.AudioTranscriptionConfig() # Optional: get text transcript
)
live_events, live_request_queue = await start_agent_session(
agent=root_agent,
run_config=run_config
)
# Send raw PCM audio directly
decoded_data = base64.b64decode(audio_data)
live_request_queue.send_realtime(
Blob(data=decoded_data, mime_type="audio/pcm")
)
# Receive raw PCM audio directly
async for event in live_events:
if event.audio_data:
audio_data = event.audio_data.data
# Play audio directly - no TTS conversion needed!
Key Points:
- Direct Audio Processing: Raw PCM audio goes directly to the model
- No Conversion: The model understands audio natively
- Multi-language: Automatic support (model handles it internally)
- Lower Latency: Fewer conversion steps = faster responses
- Simpler Code: Less integration complexity
Understanding the Fundamental Difference
Why Traditional LLMs Need STT/TTS
Most LLMs (GPT-4, Claude, Llama, etc.) are text-only models. They understand text, not audio. This is why you need:
- STT: Convert audio → text (so the LLM can understand)
- LLM: Process text → text (the actual intelligence)
- TTS: Convert text → audio (so the user can hear)
Why Native Audio Models Don't Need STT/TTS
Native audio models are trained on audio directly. They:
- Understand audio without text conversion
- Generate audio without text conversion
- Process multiple languages automatically (they understand audio patterns, not just text)
Think of it like the difference between:
- Traditional: Audio → Text → AI → Text → Audio (like using a translator)
- Native: Audio → AI → Audio (like talking directly)
Architecture Comparison
Traditional Pipeline (VAPI/LiveKit)
┌─────────────┐ Audio Stream ┌─────────────┐
│ Client │ ─────────────────────────────> │ LiveKit │
│ (Microphone)│ │ Server │
└─────────────┘ └─────────────┘
│
│ Audio Stream
▼
┌─────────────┐
│ STT Provider│
│ (Speech-to-Text)│
└─────────────┘
│
│ Text
▼
┌─────────────┐
│ LLM │
│ (Text-only) │
└─────────────┘
│
│ Text Response
▼
┌─────────────┐
│ TTS Provider│
│ (Text-to-Speech)│
└─────────────┘
│
│ Audio Stream
▼
┌─────────────┐
│ LiveKit │
│ Server │
└─────────────┘
│
┌─────────────┐ Audio Stream ┌─────────────┐
│ Client │ <───────────────────────────── │ LiveKit │
│ (Speaker) │ │ Server │
└─────────────┘ └─────────────┘
Components Required:
- STT Provider ✅
- LLM ✅
- TTS Provider ✅
- Transport Layer (LiveKit/VAPI) ✅
Native Audio Pipeline (ADK)
┌─────────────┐ Raw PCM Audio ┌──────────────┐
│ Browser │ ──────────────────────────────> │ ADK Agent │
│ (Microphone)│ │ │
└─────────────┘ │ Native Audio│
│ Model │
┌─────────────┐ Raw PCM Audio │ (Gemini) │
│ Browser │ <────────────────────────────── │ │
│ (Speaker) │ └──────────────┘
└─────────────┘
Components Required:
- Native Audio Model ✅
- Transport Layer (optional, can be direct) ✅
Comparison Table
| Aspect | ADK Native Audio | LiveKit/VAPI + Traditional LLM |
|---|---|---|
| Audio Input | Direct PCM to model | Audio → STT → Text → LLM |
| Audio Output | Direct PCM from model | LLM → Text → TTS → Audio |
| STT Provider | ❌ Not needed | ✅ Required |
| TTS Provider | ❌ Not needed | ✅ Required |
| LLM | ✅ Required (native audio) | ✅ Required (text-only) |
| Multi-language | ✅ Automatic (model handles it) | ⚠️ Must configure STT/TTS per language |
| Latency | Lower (direct processing) | Higher (multiple conversion steps) |
| Complexity | Simpler (single model) | More complex (3+ services) |
| Cost | Single pricing model | Multiple pricing models (STT + LLM + TTS) |
| Setup Time | Minutes | Hours (integration work) |
| Error Points | 1 (model) | 3+ (STT, LLM, TTS) |
When to Use What: Decision Guide
After building agents with both approaches, here's my guide for choosing the right architecture:
Use Native Audio Models (ADK/Gemini/OpenAI Realtime) When:
1. Simplicity & Speed to Market
- ✅ You want the fastest path to a working voice agent
- ✅ You prefer fewer moving parts
- ✅ You want less maintenance overhead
My Experience: I built a working ADK voice agent in hours vs. days with VAPI/LiveKit.
2. Low Latency & Natural Conversations
- ✅ You need the most natural, low-latency conversations
- ✅ You want end-to-end audio processing
- ✅ You value natural speech patterns and prosody
My Experience: ADK conversations felt more natural with lower latency.
3. Web/App-Based Voice Agents
- ✅ Your voice agent is for web or mobile apps
- ✅ You don't need telephony (phone calls)
- ✅ You're building 1-on-1 conversations
My Experience: Perfect for web applications and mobile apps.
4. Multi-Language Support
- ✅ You need automatic multi-language support
- ✅ You don't want to configure STT/TTS for each language
- ✅ You want the model to handle language detection
My Experience: I could switch languages mid-conversation without any configuration.
5. Cost Simplicity
- ✅ You prefer a single pricing model
- ✅ You want predictable costs
- ✅ You don't need to optimize per-component costs
My Experience: Easier to budget with a single cost structure.
Use VAPI/LiveKit When:
1. Telephony & PSTN Integration (VAPI)
- ✅ You need to make/receive phone calls (PSTN, SIP)
- ✅ You need phone number management
- ✅ You need call routing and analytics
Example Use Cases:
- Customer service hotlines
- Appointment scheduling via phone
- Outbound sales calls
- IVR (Interactive Voice Response) systems
Why Not Native Audio: ADK/Gemini focus on web/app audio, not telephony infrastructure.
2. Video Conferencing & Multi-Participant (LiveKit)
- ✅ You need video calls or screen sharing
- ✅ You need multiple participants in the same session
- ✅ You need spatial audio for group calls
Example Use Cases:
- Video customer support
- Virtual meetings with AI participants
- Educational platforms with video
- Collaborative workspaces
Why Not Native Audio: Native audio models focus on 1-on-1 voice conversations, not video or multi-party.
3. Best-of-Breed Components
- ✅ You want to use the best STT, LLM, and TTS for each component
- ✅ You want to mix and match providers
- ✅ You need vendor independence
Example Configuration:
STT: Deepgram (best accuracy for your domain)
LLM: Claude 3.7 (best reasoning for complex queries)
TTS: ElevenLabs (best voice quality and naturalness)
Why Not Native Audio: You're limited to the provider's integrated STT/TTS (can't swap components).
4. Enterprise Integrations & Workflows (VAPI)
- ✅ You need deep integrations with CRMs, databases, business systems
- ✅ You need workflow automation
- ✅ You need data synchronization
Example Use Cases:
- Sales calls that update CRM automatically
- Support calls that create tickets
- Appointment calls that update calendars
Why Not Native Audio: ADK focuses on agent logic, not business system integrations.
5. Self-Hosting & Compliance (LiveKit)
- ✅ You need to host infrastructure yourself
- ✅ You need to meet data residency requirements (HIPAA, GDPR)
- ✅ You need full control over data and infrastructure
Example Use Cases:
- Healthcare applications (HIPAA compliance)
- Financial services (data residency)
- High-volume applications (cost optimization)
Why Not Native Audio: ADK/Gemini are cloud services (less control over infrastructure).
6. Advanced Audio Processing & Control
- ✅ You need fine-grained control over audio processing pipeline
- ✅ You need custom Voice Activity Detection (VAD)
- ✅ You need advanced echo cancellation and noise suppression
Why Not Native Audio: Native models handle audio processing internally (less control).
7. Cost Optimization Through Provider Selection
- ✅ You need to optimize costs by choosing cheaper providers for specific components
- ✅ You want to leverage volume discounts from different providers
- ✅ You need component-level cost tracking
Why Not Native Audio: Single provider pricing (may not be optimal for all use cases).
Decision Matrix
| Requirement | Use VAPI/LiveKit | Use Native Audio (ADK/Gemini/OpenAI) |
|---|---|---|
| Phone Calls (PSTN) | ✅ VAPI | ❌ |
| Video Conferencing | ✅ LiveKit | ❌ |
| Multi-Participant | ✅ LiveKit | ❌ |
| Best-of-Breed Components | ✅ | ❌ |
| Enterprise Integrations | ✅ VAPI | ❌ |
| Self-Hosting | ✅ LiveKit | ❌ |
| Fastest Development | ❌ | ✅ |
| Lowest Latency | ❌ | ✅ |
| Most Natural Speech | ❌ | ✅ |
| Web/App Voice Agents | ⚠️ Possible | ✅ Ideal |
| Cost Optimization | ✅ | ⚠️ |
| Compliance/Data Control | ✅ LiveKit | ⚠️ |
| Multi-Language (Auto) | ⚠️ | ✅ |
| Simplicity | ❌ | ✅ |
Hybrid Approach: Best of Both Worlds
You can actually combine both approaches:
Use LiveKit for Infrastructure + Native Audio for Processing
Client → LiveKit (WebRTC) → ADK Agent (Gemini Native Audio) → LiveKit → Client
This gives you:
- ✅ LiveKit's infrastructure capabilities (video, multi-party)
- ✅ Native audio's natural conversations (no STT/TTS)
- ✅ Best of both worlds
When to Use Hybrid:
- You need LiveKit's video/multi-party features BUT want native audio's simplicity
- You want WebRTC transport BUT don't want STT/TTS complexity
- You need infrastructure flexibility BUT prefer natural conversations
Provider Comparison: Native Audio Support
Google Gemini (via ADK)
✅ Full Native Audio Support
-
Models:
gemini-2.0-flash-live-001,gemini-live-2.5-flash-preview-native-audio-09-2025 - Capability: True end-to-end audio processing
- Architecture: Single model handles audio input → understanding → generation → audio output
- No STT/TTS needed: The model processes raw PCM audio directly
- Multi-language: Automatic support (model understands audio natively)
My Experience: This is what I used, and it worked flawlessly without any STT/TTS configuration.
OpenAI
⚠️ Native Audio Support (with Ecosystem Complexity)
- Models: GPT-4o Realtime API, GPT-4o-Audio-Preview
- Capability: GPT-4o has full native audio support (similar to Gemini)
- Architecture: Realtime API provides true native speech-to-speech interactions
-
Complexity: OpenAI also offers separate STT/TTS models (
gpt-4o-transcribe,gpt-4o-mini-tts), which can be confusing - Multi-language: Supported natively through Realtime API
Key Difference: OpenAI offers both native audio AND separate STT/TTS models, while Gemini focuses on native audio only.
Anthropic Claude
❌ No Native Audio API
- Models: Claude 3.5 Sonnet, Claude 3.7 (text-based)
- Capability: Text-only models
- Architecture: Requires third-party integrations for audio
- STT/TTS Required: Yes, must use external providers
- Multi-language: Depends on third-party STT/TTS providers
Note: Anthropic does not provide native audio APIs. Any voice interactions with Claude require external STT/TTS services.
My Key Learnings
1. Not All LLMs Are Created Equal
The biggest revelation was understanding that some models are native audio models while others are text-only models. This fundamental difference determines your entire architecture.
2. Simplicity Wins for Most Use Cases
For web/app voice agents, native audio models provide:
- Faster development
- Lower latency
- More natural conversations
- Simpler architecture
Unless you have specific requirements (telephony, video, enterprise integrations), native audio is usually the better choice.
3. The Trade-offs Are Real
VAPI/LiveKit provide valuable features that native audio models don't:
- Telephony infrastructure
- Video conferencing
- Best-of-breed component selection
- Enterprise integrations
But these come with complexity and cost.
4. Hybrid Approaches Are Possible
You don't have to choose one or the other. You can use LiveKit for infrastructure and native audio models for processing, getting the best of both worlds.
5. Language Support Is a Game-Changer
Native audio models handle multiple languages automatically, without configuration. This is a huge advantage over traditional STT/TTS pipelines where you need to configure each language separately.
Real-World Examples from My Projects
Project 1: Custom Audio Streaming App (ADK)
What I Built: A web application for real-time voice conversations with Google Search integration.
Architecture: ADK with Gemini native audio model
Why ADK:
- Web-based application
- Needed fast development
- Wanted natural conversations
- Multi-language support without configuration
Result: Built in hours, worked flawlessly, no STT/TTS needed.
Project 2: Voice Bot with LiveKit (Hypothetical)
What I Would Build: A video customer support system with multiple participants.
Architecture: LiveKit + STT + LLM + TTS
Why LiveKit:
- Need video conferencing
- Multiple participants
- Need screen sharing
Why Not ADK: ADK doesn't support video or multi-participant sessions.
Summary: When to Choose What
Choose Native Audio (ADK/Gemini/OpenAI) if:
- ✅ You want the simplest solution
- ✅ You need the lowest latency
- ✅ You want the most natural conversations
- ✅ You're building web/app voice agents
- ✅ You want fastest time to market
- ✅ You need automatic multi-language support
- ✅ You prefer single pricing model
Choose VAPI/LiveKit if:
- ✅ You need telephony (phone calls)
- ✅ You need video conferencing
- ✅ You need multi-participant sessions
- ✅ You need enterprise integrations
- ✅ You need to self-host
- ✅ You want best-of-breed components
- ✅ You need fine-grained audio control
- ✅ You need component-level cost optimization
Consider Hybrid if:
- ✅ You need LiveKit's infrastructure (video, multi-party) BUT want native audio's natural conversations
Conclusion
My journey from VAPI/LiveKit to ADK taught me that not all voice agent architectures are created equal. Native audio models represent a fundamental shift in how we think about voice AI - from complex pipelines to direct audio processing.
The key insight: If you're building web/app voice agents and don't need telephony or video, native audio models (like Gemini via ADK) are usually the better choice. They're simpler, faster, and more natural.
But if you need telephony, video, enterprise integrations, or best-of-breed components, VAPI/LiveKit with traditional STT/TTS pipelines are still the right choice.
The important thing is understanding the trade-offs and choosing the right architecture for your specific use case.
Resources
- Google ADK Documentation - Complete ADK guide
- Gemini Live API Documentation - Understanding Live API capabilities
- My GitHub Repository - Code examples for both approaches
- Building Custom Audio Streaming Apps with ADK - My previous blog post
- VAPI Documentation - VAPI platform guide
- LiveKit Documentation - LiveKit platform guide
Top comments (0)