DEV Community

Cover image for # ADK vs LiveKit vs VAPI: My Journey from STT/TTS Complexity to Native Audio Simplicity
Md Enayetur Rahman
Md Enayetur Rahman

Posted on

# ADK vs LiveKit vs VAPI: My Journey from STT/TTS Complexity to Native Audio Simplicity

The Discovery That Changed Everything

When I first started building voice agents, I used VAPI and LiveKit - two powerful platforms that seemed like the obvious choices for real-time voice interactions. I spent considerable time integrating Speech-to-Text (STT) providers, configuring Text-to-Speech (TTS) services, and connecting them to LLMs. It was complex, but it worked.

Then I discovered Google ADK and something remarkable happened: I built a fully functional voice agent without any STT or TTS providers. Just the LLM. This discovery led me down a rabbit hole of understanding that fundamentally changed how I think about voice agent architecture.

In this blog post, I'll share what I learned about native audio models versus traditional STT/TTS pipelines, and help you understand when to use each approach.


My Initial Experience: VAPI and LiveKit

The Traditional Pipeline

When building voice agents with VAPI and LiveKit, I had to manage a complete pipeline:

User Audio → STT Provider → Text → LLM → Text → TTS Provider → Audio → User
Enter fullscreen mode Exit fullscreen mode

What I had to configure:

  1. STT Provider (Speech-to-Text)

    • Options: Deepgram, AssemblyAI, Google Speech-to-Text, Azure Speech
    • Configuration: API keys, language settings, accuracy tuning
    • Cost: Per-minute pricing
  2. LLM (Large Language Model)

    • Options: GPT-4, Claude, Llama, etc.
    • Configuration: API keys, model selection, prompt engineering
    • Cost: Per-token pricing
  3. TTS Provider (Text-to-Speech)

    • Options: ElevenLabs, Azure TTS, Google TTS, etc.
    • Configuration: Voice selection, language settings, quality settings
    • Cost: Per-character or per-minute pricing

The Challenges:

  • Complexity: Managing three separate services and their integrations
  • Latency: Multiple conversion steps added delay
  • Cost: Paying for three different services
  • Language Support: Had to configure STT/TTS for each language
  • Error Handling: Failures could occur at any point in the pipeline
  • Synchronization: Ensuring audio/text alignment across services

It worked, but it felt like I was building a Rube Goldberg machine when I just wanted a voice conversation.


The ADK Revelation: Native Audio Models

Discovering Native Audio

When I started using Google ADK with Gemini models, I noticed something unusual: there was no STT or TTS configuration. I could send raw audio directly to the model and receive audio back. How was this possible?

The answer: Native Audio Models.

What Are Native Audio Models?

Native audio models are AI models that understand and generate audio directly, without converting to text first. They process raw PCM audio end-to-end, similar to how humans process speech.

Models that support native audio:

  • Google Gemini: gemini-2.0-flash-live-001, gemini-live-2.5-flash-preview-native-audio-09-2025
  • OpenAI: GPT-4o Realtime API, GPT-4o-Audio-Preview

The Simplified Architecture

With ADK and native audio models, the pipeline becomes dramatically simpler:

User Audio → Native Audio Model → User Audio
Enter fullscreen mode Exit fullscreen mode

What I configured:

  1. LLM Only (with native audio support)
    • Model: Gemini Live API models
    • Configuration: Just the model and API key
    • Cost: Single pricing model

That's it. No STT. No TTS. Just the model.

Code Example: ADK Native Audio

Here's how simple it is to send audio with ADK:

from google.adk.agents import Agent
from google.adk.streaming import start_agent_session, RunConfig, StreamingMode

# Create agent with native audio model
root_agent = Agent(
    name="voice_agent",
    model="gemini-2.0-flash-live-001",  # Native audio model
    description="A voice assistant",
    instruction="Have natural conversations with users",
)

# Start streaming session
run_config = RunConfig(
    streaming_mode=StreamingMode.BIDI,
    response_modalities=["AUDIO"],  # Direct audio modality
    output_audio_transcription=types.AudioTranscriptionConfig()  # Optional: get text transcript
)

live_events, live_request_queue = await start_agent_session(
    agent=root_agent,
    run_config=run_config
)

# Send raw PCM audio directly
decoded_data = base64.b64decode(audio_data)
live_request_queue.send_realtime(
    Blob(data=decoded_data, mime_type="audio/pcm")
)

# Receive raw PCM audio directly
async for event in live_events:
    if event.audio_data:
        audio_data = event.audio_data.data
        # Play audio directly - no TTS conversion needed!
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • Direct Audio Processing: Raw PCM audio goes directly to the model
  • No Conversion: The model understands audio natively
  • Multi-language: Automatic support (model handles it internally)
  • Lower Latency: Fewer conversion steps = faster responses
  • Simpler Code: Less integration complexity

Understanding the Fundamental Difference

Why Traditional LLMs Need STT/TTS

Most LLMs (GPT-4, Claude, Llama, etc.) are text-only models. They understand text, not audio. This is why you need:

  1. STT: Convert audio → text (so the LLM can understand)
  2. LLM: Process text → text (the actual intelligence)
  3. TTS: Convert text → audio (so the user can hear)

Why Native Audio Models Don't Need STT/TTS

Native audio models are trained on audio directly. They:

  • Understand audio without text conversion
  • Generate audio without text conversion
  • Process multiple languages automatically (they understand audio patterns, not just text)

Think of it like the difference between:

  • Traditional: Audio → Text → AI → Text → Audio (like using a translator)
  • Native: Audio → AI → Audio (like talking directly)

Architecture Comparison

Traditional Pipeline (VAPI/LiveKit)

┌─────────────┐         Audio Stream          ┌─────────────┐
│   Client    │ ─────────────────────────────> │   LiveKit   │
│  (Microphone)│                                │   Server    │
└─────────────┘                                └─────────────┘
                                                      │
                                                      │ Audio Stream
                                                      ▼
                                               ┌─────────────┐
                                               │  STT Provider│
                                               │ (Speech-to-Text)│
                                               └─────────────┘
                                                      │
                                                      │ Text
                                                      ▼
                                               ┌─────────────┐
                                               │   LLM       │
                                               │ (Text-only) │
                                               └─────────────┘
                                                      │
                                                      │ Text Response
                                                      ▼
                                               ┌─────────────┐
                                               │  TTS Provider│
                                               │ (Text-to-Speech)│
                                               └─────────────┘
                                                      │
                                                      │ Audio Stream
                                                      ▼
                                               ┌─────────────┐
                                               │   LiveKit   │
                                               │   Server    │
                                               └─────────────┘
                                                      │
┌─────────────┐         Audio Stream          ┌─────────────┐
│   Client    │ <───────────────────────────── │   LiveKit   │
│  (Speaker)  │                                │   Server    │
└─────────────┘                                └─────────────┘
Enter fullscreen mode Exit fullscreen mode

Components Required:

  • STT Provider ✅
  • LLM ✅
  • TTS Provider ✅
  • Transport Layer (LiveKit/VAPI) ✅

Native Audio Pipeline (ADK)

┌─────────────┐         Raw PCM Audio          ┌──────────────┐
│   Browser   │ ──────────────────────────────> │  ADK Agent   │
│  (Microphone)│                                 │              │
└─────────────┘                                 │  Native Audio│
                                               │    Model     │
┌─────────────┐         Raw PCM Audio          │  (Gemini)    │
│   Browser   │ <────────────────────────────── │              │
│  (Speaker)  │                                 └──────────────┘
└─────────────┘
Enter fullscreen mode Exit fullscreen mode

Components Required:

  • Native Audio Model ✅
  • Transport Layer (optional, can be direct) ✅

Comparison Table

Aspect ADK Native Audio LiveKit/VAPI + Traditional LLM
Audio Input Direct PCM to model Audio → STT → Text → LLM
Audio Output Direct PCM from model LLM → Text → TTS → Audio
STT Provider ❌ Not needed ✅ Required
TTS Provider ❌ Not needed ✅ Required
LLM ✅ Required (native audio) ✅ Required (text-only)
Multi-language ✅ Automatic (model handles it) ⚠️ Must configure STT/TTS per language
Latency Lower (direct processing) Higher (multiple conversion steps)
Complexity Simpler (single model) More complex (3+ services)
Cost Single pricing model Multiple pricing models (STT + LLM + TTS)
Setup Time Minutes Hours (integration work)
Error Points 1 (model) 3+ (STT, LLM, TTS)

When to Use What: Decision Guide

After building agents with both approaches, here's my guide for choosing the right architecture:

Use Native Audio Models (ADK/Gemini/OpenAI Realtime) When:

1. Simplicity & Speed to Market

  • ✅ You want the fastest path to a working voice agent
  • ✅ You prefer fewer moving parts
  • ✅ You want less maintenance overhead

My Experience: I built a working ADK voice agent in hours vs. days with VAPI/LiveKit.

2. Low Latency & Natural Conversations

  • ✅ You need the most natural, low-latency conversations
  • ✅ You want end-to-end audio processing
  • ✅ You value natural speech patterns and prosody

My Experience: ADK conversations felt more natural with lower latency.

3. Web/App-Based Voice Agents

  • ✅ Your voice agent is for web or mobile apps
  • ✅ You don't need telephony (phone calls)
  • ✅ You're building 1-on-1 conversations

My Experience: Perfect for web applications and mobile apps.

4. Multi-Language Support

  • ✅ You need automatic multi-language support
  • ✅ You don't want to configure STT/TTS for each language
  • ✅ You want the model to handle language detection

My Experience: I could switch languages mid-conversation without any configuration.

5. Cost Simplicity

  • ✅ You prefer a single pricing model
  • ✅ You want predictable costs
  • ✅ You don't need to optimize per-component costs

My Experience: Easier to budget with a single cost structure.


Use VAPI/LiveKit When:

1. Telephony & PSTN Integration (VAPI)

  • ✅ You need to make/receive phone calls (PSTN, SIP)
  • ✅ You need phone number management
  • ✅ You need call routing and analytics

Example Use Cases:

  • Customer service hotlines
  • Appointment scheduling via phone
  • Outbound sales calls
  • IVR (Interactive Voice Response) systems

Why Not Native Audio: ADK/Gemini focus on web/app audio, not telephony infrastructure.

2. Video Conferencing & Multi-Participant (LiveKit)

  • ✅ You need video calls or screen sharing
  • ✅ You need multiple participants in the same session
  • ✅ You need spatial audio for group calls

Example Use Cases:

  • Video customer support
  • Virtual meetings with AI participants
  • Educational platforms with video
  • Collaborative workspaces

Why Not Native Audio: Native audio models focus on 1-on-1 voice conversations, not video or multi-party.

3. Best-of-Breed Components

  • ✅ You want to use the best STT, LLM, and TTS for each component
  • ✅ You want to mix and match providers
  • ✅ You need vendor independence

Example Configuration:

STT: Deepgram (best accuracy for your domain)
LLM: Claude 3.7 (best reasoning for complex queries)
TTS: ElevenLabs (best voice quality and naturalness)
Enter fullscreen mode Exit fullscreen mode

Why Not Native Audio: You're limited to the provider's integrated STT/TTS (can't swap components).

4. Enterprise Integrations & Workflows (VAPI)

  • ✅ You need deep integrations with CRMs, databases, business systems
  • ✅ You need workflow automation
  • ✅ You need data synchronization

Example Use Cases:

  • Sales calls that update CRM automatically
  • Support calls that create tickets
  • Appointment calls that update calendars

Why Not Native Audio: ADK focuses on agent logic, not business system integrations.

5. Self-Hosting & Compliance (LiveKit)

  • ✅ You need to host infrastructure yourself
  • ✅ You need to meet data residency requirements (HIPAA, GDPR)
  • ✅ You need full control over data and infrastructure

Example Use Cases:

  • Healthcare applications (HIPAA compliance)
  • Financial services (data residency)
  • High-volume applications (cost optimization)

Why Not Native Audio: ADK/Gemini are cloud services (less control over infrastructure).

6. Advanced Audio Processing & Control

  • ✅ You need fine-grained control over audio processing pipeline
  • ✅ You need custom Voice Activity Detection (VAD)
  • ✅ You need advanced echo cancellation and noise suppression

Why Not Native Audio: Native models handle audio processing internally (less control).

7. Cost Optimization Through Provider Selection

  • ✅ You need to optimize costs by choosing cheaper providers for specific components
  • ✅ You want to leverage volume discounts from different providers
  • ✅ You need component-level cost tracking

Why Not Native Audio: Single provider pricing (may not be optimal for all use cases).


Decision Matrix

Requirement Use VAPI/LiveKit Use Native Audio (ADK/Gemini/OpenAI)
Phone Calls (PSTN) ✅ VAPI
Video Conferencing ✅ LiveKit
Multi-Participant ✅ LiveKit
Best-of-Breed Components
Enterprise Integrations ✅ VAPI
Self-Hosting ✅ LiveKit
Fastest Development
Lowest Latency
Most Natural Speech
Web/App Voice Agents ⚠️ Possible ✅ Ideal
Cost Optimization ⚠️
Compliance/Data Control ✅ LiveKit ⚠️
Multi-Language (Auto) ⚠️
Simplicity

Hybrid Approach: Best of Both Worlds

You can actually combine both approaches:

Use LiveKit for Infrastructure + Native Audio for Processing

Client → LiveKit (WebRTC) → ADK Agent (Gemini Native Audio) → LiveKit → Client
Enter fullscreen mode Exit fullscreen mode

This gives you:

  • ✅ LiveKit's infrastructure capabilities (video, multi-party)
  • ✅ Native audio's natural conversations (no STT/TTS)
  • ✅ Best of both worlds

When to Use Hybrid:

  • You need LiveKit's video/multi-party features BUT want native audio's simplicity
  • You want WebRTC transport BUT don't want STT/TTS complexity
  • You need infrastructure flexibility BUT prefer natural conversations

Provider Comparison: Native Audio Support

Google Gemini (via ADK)

Full Native Audio Support

  • Models: gemini-2.0-flash-live-001, gemini-live-2.5-flash-preview-native-audio-09-2025
  • Capability: True end-to-end audio processing
  • Architecture: Single model handles audio input → understanding → generation → audio output
  • No STT/TTS needed: The model processes raw PCM audio directly
  • Multi-language: Automatic support (model understands audio natively)

My Experience: This is what I used, and it worked flawlessly without any STT/TTS configuration.

OpenAI

⚠️ Native Audio Support (with Ecosystem Complexity)

  • Models: GPT-4o Realtime API, GPT-4o-Audio-Preview
  • Capability: GPT-4o has full native audio support (similar to Gemini)
  • Architecture: Realtime API provides true native speech-to-speech interactions
  • Complexity: OpenAI also offers separate STT/TTS models (gpt-4o-transcribe, gpt-4o-mini-tts), which can be confusing
  • Multi-language: Supported natively through Realtime API

Key Difference: OpenAI offers both native audio AND separate STT/TTS models, while Gemini focuses on native audio only.

Anthropic Claude

No Native Audio API

  • Models: Claude 3.5 Sonnet, Claude 3.7 (text-based)
  • Capability: Text-only models
  • Architecture: Requires third-party integrations for audio
  • STT/TTS Required: Yes, must use external providers
  • Multi-language: Depends on third-party STT/TTS providers

Note: Anthropic does not provide native audio APIs. Any voice interactions with Claude require external STT/TTS services.


My Key Learnings

1. Not All LLMs Are Created Equal

The biggest revelation was understanding that some models are native audio models while others are text-only models. This fundamental difference determines your entire architecture.

2. Simplicity Wins for Most Use Cases

For web/app voice agents, native audio models provide:

  • Faster development
  • Lower latency
  • More natural conversations
  • Simpler architecture

Unless you have specific requirements (telephony, video, enterprise integrations), native audio is usually the better choice.

3. The Trade-offs Are Real

VAPI/LiveKit provide valuable features that native audio models don't:

  • Telephony infrastructure
  • Video conferencing
  • Best-of-breed component selection
  • Enterprise integrations

But these come with complexity and cost.

4. Hybrid Approaches Are Possible

You don't have to choose one or the other. You can use LiveKit for infrastructure and native audio models for processing, getting the best of both worlds.

5. Language Support Is a Game-Changer

Native audio models handle multiple languages automatically, without configuration. This is a huge advantage over traditional STT/TTS pipelines where you need to configure each language separately.


Real-World Examples from My Projects

Project 1: Custom Audio Streaming App (ADK)

What I Built: A web application for real-time voice conversations with Google Search integration.

Architecture: ADK with Gemini native audio model

Why ADK:

  • Web-based application
  • Needed fast development
  • Wanted natural conversations
  • Multi-language support without configuration

Result: Built in hours, worked flawlessly, no STT/TTS needed.

Project 2: Voice Bot with LiveKit (Hypothetical)

What I Would Build: A video customer support system with multiple participants.

Architecture: LiveKit + STT + LLM + TTS

Why LiveKit:

  • Need video conferencing
  • Multiple participants
  • Need screen sharing

Why Not ADK: ADK doesn't support video or multi-participant sessions.


Summary: When to Choose What

Choose Native Audio (ADK/Gemini/OpenAI) if:

  • ✅ You want the simplest solution
  • ✅ You need the lowest latency
  • ✅ You want the most natural conversations
  • ✅ You're building web/app voice agents
  • ✅ You want fastest time to market
  • ✅ You need automatic multi-language support
  • ✅ You prefer single pricing model

Choose VAPI/LiveKit if:

  • ✅ You need telephony (phone calls)
  • ✅ You need video conferencing
  • ✅ You need multi-participant sessions
  • ✅ You need enterprise integrations
  • ✅ You need to self-host
  • ✅ You want best-of-breed components
  • ✅ You need fine-grained audio control
  • ✅ You need component-level cost optimization

Consider Hybrid if:

  • ✅ You need LiveKit's infrastructure (video, multi-party) BUT want native audio's natural conversations

Conclusion

My journey from VAPI/LiveKit to ADK taught me that not all voice agent architectures are created equal. Native audio models represent a fundamental shift in how we think about voice AI - from complex pipelines to direct audio processing.

The key insight: If you're building web/app voice agents and don't need telephony or video, native audio models (like Gemini via ADK) are usually the better choice. They're simpler, faster, and more natural.

But if you need telephony, video, enterprise integrations, or best-of-breed components, VAPI/LiveKit with traditional STT/TTS pipelines are still the right choice.

The important thing is understanding the trade-offs and choosing the right architecture for your specific use case.


Resources

Top comments (0)