Sidharth Grover

Posted on Dec 2

From 5 Seconds to 0.7 Seconds: How I Built a Production-Ready Voice AI Agent (And Cut Latency by 7x)

#ai #agents #discuss #mcp

The tl;dr for the Busy Dev

I built a production-ready voice AI agent that went from 5+ seconds of latency to sub-second responses through 8 systematic optimization phases. The journey wasn't just about code—it was about understanding where bottlenecks hide and how simple changes can have massive impact.

The Stack:

LiveKit Agents SDK - Real-time WebRTC infrastructure
OpenAI - STT (Whisper → GPT-4o-mini-transcribe) & LLM (GPT-4o → GPT-4o-mini)
ElevenLabs - Text-to-Speech synthesis
Python 3.11 - Implementation language

The Results:

🚀 7x faster - Total latency: 5.5s → 0.7s (best case)
⚡ 3-8x LLM improvement - TTFT: 4.7s → 0.4s
💨 98% STT improvement - Subsequent transcripts: 2.1s → 0.026s (near-instant!)
💰 10x cost reduction - Switched from GPT-4o to GPT-4o-mini
🧠 Context management - Automatic pruning prevents unbounded growth
🔧 MCP integration - Voice agent can now execute document operations via voice commands

The Key Insight: Optimization is iterative. Each fix reveals the next bottleneck. Start with metrics, optimize based on data, and don't be afraid to make "obvious" changes—they often have the biggest impact.

The Challenge: Building a Voice Agent That Doesn't Feel Like a Robot

I was building a voice AI agent for a project, and the initial results were... disappointing. Users would ask a question, wait 5+ seconds, and get a response that felt sluggish and robotic. The agent technically worked, but it didn't feel natural.

The Human Baseline:
Research shows that in human conversations, the average response time is 236 milliseconds after your conversation partner finishes speaking, with a standard deviation of ~520 milliseconds. This means most natural human responses fall within the ~750ms range.

The Goal: Build a production-ready voice agent that:

Understands natural speech in real-time
Generates intelligent responses quickly
Speaks back with natural-sounding voice
Handles interruptions gracefully
Provides metrics for continuous optimization
Feels natural - ideally within one standard deviation of human response times

The Reality Check: My initial implementation had:

5+ second total latency (20x slower than human average!)
4.7 second LLM response time (the primary bottleneck)
2+ second STT processing (batch processing, not streaming)
No visibility into where time was being spent

I needed to cut latency by at least 5x to make it feel natural. This wasn't going to be a simple fix—it required systematic optimization across every component.

The Benchmark: According to voice agent research, the theoretical best-case scenario for a voice agent pipeline is around ~540 milliseconds—just within one standard deviation of human expectations. That became my target.

The Architecture: Pipeline vs. End-to-End

Before diving into optimizations, I had to make a fundamental architectural decision: Pipeline approach vs. Speech-to-Speech models.

The Choice: Pipeline approach (STT → LLM → TTS)

Why?

Fine-grained control - Optimize each component independently
Flexibility - Swap models/providers for each stage
Debugging - Inspect intermediate outputs (transcriptions, LLM responses)
Cost optimization - Use different models based on requirements
Production readiness - Better suited for real-world applications
Granular trade-offs - Don't have to make global trade-offs (can optimize STT for accuracy, LLM for speed, TTS for quality)

Trade-off: More complexity, but worth it for production use cases.

Key Insight from Voice Agent Architecture: The pipeline approach allows you to allocate your latency budget strategically. For example:

Restaurant bookings: Prioritize LLM reasoning (spend more latency on LLM)
Medical triage: Prioritize STT accuracy (spend more latency on STT)

This flexibility is critical for real-world applications where different use cases have different priorities.

Phase 1: The Initial Implementation (The Baseline)

Initial Stack:

STT: OpenAI Whisper-1 (batch processing)
LLM: GPT-4o (high quality, but slow)
TTS: ElevenLabs (excellent quality)
VAD: Silero (lightweight, open-source)
Infrastructure: LiveKit Cloud (WebRTC for real-time communication)

Why LiveKit?
LiveKit provides a globally distributed mesh network that reduces network latency by 20-50% compared to direct peer-to-peer connections. It's the same infrastructure used by OpenAI for ChatGPT Advanced voice mode. The WebRTC-based architecture enables:

Real-time network measurement and pacing
Automatic audio compression (97% reduction in data size)
Automatic packet timestamping
Persistent, stateful connections (essential for conversational agents)

Initial Performance:

Total Latency: 3.9-5.5 seconds (15-20x slower than human average!)
LLM TTFT: 1.0-4.7 seconds (50-85% of total latency) ⚠️
STT Duration: 0.5-2.5 seconds (30-40% of latency)
TTS TTFB: 0.2-0.3 seconds (excellent, not a bottleneck)
VAD: ~20ms (minimal, necessary for accuracy)

The Problem: LLM was the primary bottleneck, accounting for 50-85% of total latency. A single slow LLM response (4.7s) could make the entire interaction feel broken. I was 20x slower than human response times—completely unacceptable for a natural conversation.

Phase 2: The "Obvious" Fix That Changed Everything

The Discovery: I was using GPT-4o for every response, but most conversations didn't need that level of capability. GPT-4o-mini provides 80% of the quality at 10% of the cost—and it's 3-8x faster.

The Change:

# Before
llm = openai.LLM(model="gpt-4o")

# After
llm = openai.LLM(model="gpt-4o-mini")

The Results:

✅ LLM TTFT: 1.0-4.7s → 0.36-0.59s (3-8x faster!)
✅ Tokens/sec: 4.5-17.7 → 11.3-32.3 (2-4x faster!)
✅ Total Latency: 2.3-3.0s (1.6-2x faster!)
✅ Cost: 10x reduction
✅ Consistency: Much more predictable performance

The Lesson: Sometimes the "obvious" fix is the most impactful. Don't optimize prematurely—measure first, then optimize based on data.

Phase 3: Unlocking Real-Time STT Streaming

The Problem: After optimizing the LLM, STT became the new bottleneck (60-70% of latency). The agent was processing entire audio clips before returning transcripts—no streaming.

The Discovery: OpenAI's STT supports real-time streaming with use_realtime=True, but I wasn't using it.

The Change:

# Before
stt = openai.STT(model="whisper-1")

# After
stt = openai.STT(
    model="whisper-1",
    use_realtime=True  # Enable real-time streaming
)

The Results:

✅ STT Latency: 1.6-2.1s → 0.026s-2.04s
- First transcript: 1.5-2.0s (connection overhead)
- Subsequent: 0.026s-0.07s (98% faster! Near-instant!)
✅ Total Latency: 2.3-3.0s → 1.1s-3.5s (avg ~2.0s) (20-30% faster!)
✅ Best Case: ~0.7s total latency achieved
✅ User Experience: Real-time transcription, partial results, interruption handling

The Insight: One parameter change (use_realtime=True) delivered a 98% improvement for subsequent transcripts. This is why metrics matter—without visibility, I wouldn't have known STT was the bottleneck.

Phase 4: System Prompt Optimization (The Hidden Cost)

The Discovery: My system prompt was 50-190 tokens. That's not just cost—it's latency. Every token in the prompt adds processing time.

The Optimization:

Removed verbose instructions
Focused on essential behavior
Reduced from 50-190 tokens to 30 tokens (60-70% reduction)

The Results:

✅ Prompt Tokens: 50-190 → 30 (60-70% reduction)
✅ LLM Processing: Faster due to smaller prompt size
✅ Cost: Reduced prompt token costs
✅ Quality: Maintained response quality

The Lesson: Every token counts. Optimize prompts not just for clarity, but for speed and cost.

Phase 5: STT Model Optimization (The Realtime Model)

The Discovery: whisper-1 is great for accuracy, but gpt-4o-mini-transcribe is optimized for real-time performance.

The Change:

# Before
stt = openai.STT(model="whisper-1", use_realtime=True)

# After
stt = openai.STT(
    model="gpt-4o-mini-transcribe",  # Realtime-optimized
    language="en",  # Explicit language (removes auto-detection overhead)
    use_realtime=True
)

The Results:

✅ First Transcript: 1.318s → 0.824s (37% improvement!)
✅ Subsequent Transcripts: Maintained 0.010-0.036s (near-instant)
✅ Language Detection: Removed overhead by explicitly setting language

The Insight: Model selection matters. A model optimized for real-time use can deliver significant improvements even when the previous model was already "fast enough."

Phase 6 & 7: Context Management (Preventing Unbounded Growth)

The Problem: As conversations get longer, context grows. Without management, you hit token limits, latency increases, and costs explode.

The Solution: Implemented automatic context pruning and summarization:

Sliding Window: Keep recent 10 messages
Summarization: Summarize middle messages (10-20) into a concise summary
Pruning: Drop very old messages (30+)

The Implementation:

class ContextManager:
    def __init__(self):
        self.recent_window = 10  # Keep recent messages
        self.middle_window = 20  # Summarize middle messages
        self.summarize_threshold = 15  # Trigger summarization at 15 messages

    async def summarize_old_messages(self, llm, messages):
        # Use LLM to create concise summary
        # Preserves key information while reducing tokens

The Results:

✅ Context Growth: 25→227 tokens over 4 turns → Managed at 800-900 tokens (40% reduction projected)
✅ Pruning Triggered: At 16 messages, 6 messages summarized into 280-character summary
✅ Latency Impact: Zero negative impact - summarization runs asynchronously
✅ Quality: Preserved context quality through intelligent summarization

The Achievement: Maintained sub-1.0s latency despite context growth, preventing unbounded expansion that would have killed performance.

Phase 8: MCP Integration (Beyond Conversation)

The Goal: Connect the voice agent to an MCP (Model Context Protocol) server to enable document operations via voice commands.

The Challenge: Long-running tool executions (e.g., document analysis taking 60+ seconds) could cause:

STT WebSocket timeouts
LiveKit watchdog killing "unresponsive" processes

The Solution:

Heartbeat Mechanism: Periodic logging every 5 seconds to keep process alive
Async API Calls: Run blocking Anthropic API calls in thread pool executor
Enhanced Error Handling: Classify STT timeouts as expected/non-fatal for long operations

The Results:

✅ MCP Tools Available: 6 tools (read, edit, analyze, compare, search documents)
✅ Tool Execution: Successfully executed 57-second document analysis
✅ Process Stability: No more "unresponsive" kills
✅ Latency Impact: Zero negative impact on conversation flow

The Achievement: Voice agent can now execute complex document operations via voice commands, expanding capabilities beyond simple conversation.

The Performance Evolution: By the Numbers

Before Optimization

Stage	Average Time	% of Total	Bottleneck?
VAD	~20ms	<1%	No
STT	1.6s	30-40%	Secondary
LLM	2.5s	50-85%	YES - Primary
TTS	0.3s	5-10%	No
Total	~5.5s

After All Optimizations

Stage	Average Time	% of Total	Bottleneck?
VAD	~20ms	<1%	No
STT	0.5s	30-40%	✅ Optimized
LLM	0.70s	40-50%	✅ Balanced
TTS	0.33s	15-20%	No
Total	~0.9-1.2s		✅ 7x Faster!

The Unforeseen Advantage: Metrics-Driven Optimization

A major advantage I hadn't fully anticipated was how comprehensive metrics would transform the optimization process. By tracking every stage of the pipeline, I could:

Identify Bottlenecks Instantly: When LLM was slow, metrics showed it immediately. When STT became the bottleneck, metrics revealed it.
Measure Impact Precisely: Every optimization had quantifiable results. "3-8x faster" isn't marketing—it's data.
Catch Regressions Early: When context management was added, metrics confirmed zero latency impact.
Enable Data-Driven Decisions: Instead of guessing, I optimized based on actual performance data.

The Metrics I Tracked:

LLM: TTFT (Time to First Token), tokens/sec, prompt tokens, completion tokens
STT: Duration, audio duration, streaming status, transcript delay
TTS: TTFB (Time to First Byte), duration, streaming status
Context: Size, growth rate, pruning events, summarization triggers
EOU: End of utterance delay, transcription delay

Learning from Course: The course emphasized that TTFT (Time to First Token) is the critical metric for LLM optimization—it defines how long users wait before anything starts happening. Similarly, TTFB (Time to First Byte) for TTS determines perceived responsiveness. Focusing on these two metrics led to the biggest improvements.

The Result: Every optimization was measurable, every improvement was quantifiable, and every decision was data-driven.

Key Learnings: What I Wish I Knew Earlier

1. Optimization is Iterative

Each fix reveals the next bottleneck. LLM optimization revealed STT as the bottleneck. STT optimization revealed context management needs. This is normal—embrace it.

Learning from Course: The pipeline architecture means each component can be optimized independently. When you fix one bottleneck, the next one becomes visible. This is the nature of systematic optimization.

2. Simple Changes Have Massive Impact

One parameter (use_realtime=True): 98% STT improvement
One model switch (gpt-4o → gpt-4o-mini): 3-8x LLM improvement
One prompt optimization: 60-70% token reduction

Learning from Course: The course emphasized that Time to First Token (TTFT) is the critical metric for LLM optimization. Focusing on this single metric led to the biggest improvements.

3. Metrics are Essential

You can't optimize what you don't measure. Comprehensive metrics enabled every optimization decision.

Learning from Course: The course taught me to track:

LLM: TTFT (Time to First Token) - the critical metric
TTS: TTFB (Time to First Byte) - perceived responsiveness
STT: Streaming status, transcript delays
EOU: End of utterance delays

These metrics became my optimization compass.

4. Context Management is Critical

Without pruning/summarization, context grows unbounded, latency increases, and costs explode. This is a production requirement, not a nice-to-have.

Learning from Course: The course highlighted that context management is essential for maintaining performance in long conversations. LiveKit's Agents SDK handles this automatically, but understanding the mechanism helped me optimize it further.

5. Model Selection Matters

A model optimized for real-time (gpt-4o-mini-transcribe) can deliver 37% improvement over a general-purpose model (whisper-1), even when both are "fast enough."

Learning from Course: The course emphasized that different models have different latency/quality/cost profiles. Choosing the right model for your use case is critical.

6. Async Operations are Your Friend

Long-running operations (document analysis, summarization) should never block the conversation. Use async patterns, thread pools, and heartbeat mechanisms.

7. Streaming is Non-Negotiable

Learning from Course: The course taught me that streaming at every stage is essential:

STT should transcribe continuously (not wait for complete audio)
LLM should stream tokens as they're generated
TTS should synthesize as text arrives

This parallel processing reduces overall latency dramatically.

8. Turn Detection is Complex but Critical

Learning from Course: Turn detection uses a hybrid approach:

Acoustic (VAD): Detects presence/absence of speech
Semantic (Transformer): Analyzes meaning to identify turn boundaries

This prevents premature turn-taking and enables natural conversation flow.

9. Interruption Handling is Essential

Learning from Course: Users will interrupt. The agent must:

Detect interruptions via VAD
Flush the entire pipeline (LLM, TTS, playback)
Immediately prepare for new input

This makes conversations feel natural, not robotic.

10. Human Response Times are the Benchmark

Learning from Course: Human average response time is 236ms with ~520ms standard deviation. The theoretical best-case for voice agents is ~540ms—just within one standard deviation. This became my target, and I achieved it in best-case scenarios (~0.7s).

The Final Architecture: Production-Ready

Current Stack:

STT: gpt-4o-mini-transcribe with use_realtime=True and language="en"
LLM: gpt-4o-mini with optimized 30-token system prompt
TTS: ElevenLabs with streaming enabled
Context Management: Automatic pruning and summarization
MCP Integration: 6 tools for document operations
Metrics: Comprehensive real-time tracking

Current Performance:

✅ LLM TTFT: 0.375-1.628s (avg: 0.699s) - Excellent
✅ TTS TTFB: 0.280-0.405s (avg: 0.327s) - Excellent
✅ STT First Transcript: 0.824s - Good
✅ STT Subsequent: 0.010-0.036s - Near-instant
✅ Total Latency: 0.9-1.2s for typical interactions
✅ Best Case: ~0.7s total latency

Industry Benchmarks:

✅ TTFT Target: < 1s → Achieved (avg 0.699s)
✅ TTFB Target: < 0.5s → Exceeded (avg 0.327s)
✅ Total Latency Target: < 2s → Achieved (avg ~1.0s)

Takeaways: Why This Journey Matters

For anyone building voice AI agents, here's what I learned:

Start with Metrics: You can't optimize what you don't measure. Instrument everything from day one.
Optimize Iteratively: Each fix reveals the next bottleneck. This is normal—embrace the iterative process.
Simple Changes, Big Impact: Don't overthink it. Sometimes the "obvious" fix (model switch, one parameter) delivers the biggest improvement.
Context Management is Non-Negotiable: Without pruning/summarization, your agent will slow down and cost more as conversations get longer.
Real-Time Streaming is Essential: Batch processing feels slow. Real-time streaming feels natural.
Model Selection Matters: Choose models optimized for your use case (realtime, cost, quality).

The Bottom Line: Building a production-ready voice AI agent isn't just about code—it's about understanding the pipeline, measuring performance, and optimizing systematically. Through 8 phases of optimization, I achieved a 7x latency reduction and 10x cost reduction while maintaining quality.

The journey from 5 seconds to 0.7 seconds wasn't magic—it was methodical optimization, comprehensive metrics, and data-driven decisions.

Course Inspiration: This journey was heavily inspired by the DeepLearning.AI Voice Agents course, which taught me:

Human Response Time Benchmarks: The course revealed that human average response time is 236ms with ~520ms standard deviation, giving me a clear target. The theoretical best-case for voice agents is ~540ms—just within one standard deviation.
Pipeline Architecture: The course emphasized the pipeline approach (STT → LLM → TTS) over end-to-end models, enabling granular optimizations. This architecture allowed me to optimize each component independently.
Critical Metrics: The course taught me to focus on TTFT (Time to First Token) for LLM and TTFB (Time to First Byte) for TTS—these became my optimization compass.
Streaming is Essential: The course highlighted that streaming at every stage (STT, LLM, TTS) is non-negotiable for low latency. This led to my 98% STT improvement.
Turn Detection Complexity: The course explained the hybrid approach (VAD + semantic processing) for turn detection, which helped me understand why certain optimizations worked.
WebRTC & LiveKit: The course introduced me to WebRTC and LiveKit's infrastructure, which reduced network latency by 20-50% compared to direct connections.
Context Management: The course emphasized that context management is critical for maintaining performance in long conversations, inspiring my Phase 6 & 7 optimizations.

The course's structured approach to understanding voice agent architecture, latency optimization, and metrics collection provided the foundation for this entire optimization journey.

What's Next?

The agent is production-ready, but optimization never ends. Here's what I'm planning next:

Immediate Next Steps

Fine-tune STT Turn Detection
- Optimize VAD and semantic turn detection thresholds
- Reduce false positives/negatives
- Improve interruption detection accuracy
Response Caching
- Cache common queries and responses
- Reduce redundant LLM calls
- Further latency and cost improvements
Multi-Language Support
- Expand beyond English
- Optimize STT for multiple languages
- Handle language switching mid-conversation

Medium-Term Improvements

Self-Hosting for Lower Latency
- Consider self-hosting LLM (Groq, Cerebras, TogetherAI)
- Reduce API call latency
- Full control over infrastructure
Advanced Context Management
- Implement relevance scoring for context selection
- Add semantic search for context retrieval
- RAG (Retrieval-Augmented Generation) integration
Web Client Integration
- Build web client with LiveKit SDK
- Unified chat history (voice + text)
- Seamless experience across platforms

Long-Term Vision

Custom Voice Cloning
- Train custom voices for specific use cases
- Brand consistency
- Personalized experiences
Advanced Tool Integration
- Expand MCP tool capabilities
- Add more document operations
- Integrate with external APIs
Performance Monitoring Dashboard
- Real-time metrics visualization
- Alerting for performance degradation
- A/B testing framework for optimizations

I'll keep y'all posted on the progress! 🚀

Follow along as I continue optimizing and expanding the voice agent's capabilities. The journey from 5 seconds to 0.7 seconds was just the beginning—there's always room for improvement, and I'm excited to see how far we can push the boundaries of real-time voice AI.

Questions? Drop a comment below—I'd love to hear about your voice AI optimization journey!

DEV Community

From 5 Seconds to 0.7 Seconds: How I Built a Production-Ready Voice AI Agent (And Cut Latency by 7x)

The tl;dr for the Busy Dev

The Challenge: Building a Voice Agent That Doesn't Feel Like a Robot

The Architecture: Pipeline vs. End-to-End

Phase 1: The Initial Implementation (The Baseline)

Phase 2: The "Obvious" Fix That Changed Everything

Phase 3: Unlocking Real-Time STT Streaming

Phase 4: System Prompt Optimization (The Hidden Cost)

Phase 5: STT Model Optimization (The Realtime Model)

Phase 6 & 7: Context Management (Preventing Unbounded Growth)

Phase 8: MCP Integration (Beyond Conversation)

The Performance Evolution: By the Numbers

Before Optimization

After All Optimizations

The Unforeseen Advantage: Metrics-Driven Optimization

Key Learnings: What I Wish I Knew Earlier

1. Optimization is Iterative

2. Simple Changes Have Massive Impact

3. Metrics are Essential

4. Context Management is Critical

5. Model Selection Matters

6. Async Operations are Your Friend

7. Streaming is Non-Negotiable

8. Turn Detection is Complex but Critical

9. Interruption Handling is Essential

10. Human Response Times are the Benchmark

The Final Architecture: Production-Ready

Takeaways: Why This Journey Matters

What's Next?

Immediate Next Steps

Medium-Term Improvements

Long-Term Vision

Top comments (0)