Building Three Voice Agents: Architecture, Latency Optimization, and Real-World Learnings
TL;DR: I built three distinct voice systems (Raspberry Pi assistant, Twilio phone agent, Alexa skill) optimizing for different constraints. End-to-end latencies range from 0.5s to 3s. The biggest wins came from understanding what actually matters in each context, not blindly chasing speed.
Introduction: Why I Built Three Voice Systems
Most teams pick one approach to voice AI. I picked three — each solving a different problem.
I needed:
- A local smart home voice assistant (Raspberry Pi, fast and resource-efficient)
- A phone-based voice agent (Twilio inbound, sub-1s responsiveness)
- An Alexa skill (leverage existing smart speaker ecosystem)
Rather than force-fit a single solution, I built each from first principles, optimizing for its specific constraints. The result: three systems with wildly different architectures, latencies, and trade-offs — and surprising learnings about what "good voice" actually means.
System 1: Jarvis — Raspberry Pi Voice Assistant
The Challenge
Home automation voice assistants have unique constraints. Network connectivity is spotty. Power is limited. Latency doesn't matter if the user isn't waiting (wake word + think time is expected). But voice quality matters — you're listening through your home's audio system.
Architecture
Jarvis listens locally on a Raspberry Pi 4 (2GB RAM, no GPU). The pipeline is straightforward:
- Wake word detection: Porcupine SDK with built-in "jarvis" keyword (~100ms)
- Audio capture: Raw PCM via ALSA microphone (20-40ms frames)
- Speech-to-text: ElevenLabs Scribe v2 API (600-1000ms)
- LLM inference: Groq llama-4-scout-17b (direct API, 300-500ms) + Claude Haiku via gateway fallback for complex reasoning
- Text-to-speech: ElevenLabs (Benedict voice, warm British butler) + Smallest.ai Lightning (Chetan voice, Indian pronunciation clarity) (300-400ms)
- Audio playback: ALSA speaker output (200ms)
TTFR (time to first audio response): 1–2 seconds (streaming TTS)
Total response time: 3–5 seconds
Key Design Decisions
Why ElevenLabs STT, not local?
I initially tried Whisper on the Pi. The ARM binary caused SIGILL crashes (illegal instruction — faster_whisper doesn't ship ARM SIMD). Cloud STT was the workaround. Trade-off: ~600ms latency for reliability.
Why Groq LLM (primary) + Gateway Fallback?
The Pi can't run local LLMs reliably (memory constraints). Groq is the primary LLM for fast inference. For complex questions that need deeper reasoning, Jarvis escalates to Claude Haiku via the gateway. This two-tier approach prioritizes speed while maintaining quality for harder questions.
Why two TTS providers?
ElevenLabs gives Benedict his warm British butler tone — polished and natural. But it's 1.1-1.9s per call and butchers Indian names and context. Smallest.ai Lightning (Chetan) runs at 300-400ms and handles Indian pronunciation correctly. In an Indian household where the assistant says "Sundar" and plays things on "Hotstar," people need to actually understand what's being said. So Jarvis uses both: Benedict for character, Chetan for clarity when Indian context matters.
Why Chetan specifically?
I tested several Smallest.ai voices. Chetan (smooth Indian male) felt the most natural in-home — warm without being overly formal. The key: he pronounces Indian names, movie titles, and place names correctly. When Jarvis says "playing Vikram Vedha on Hotstar," everyone in the room understands it. That's the bar.
Latency Breakdown
| Stage | Time |
|---|---|
| Wake word detection | 100ms |
| Audio capture + transmission | 40ms |
| STT (Scribe v2) | 600-1000ms |
| LLM (Groq Scout direct) | 300-500ms |
| TTS (Smallest.ai) | 300-400ms |
| Audio playback setup | 200ms |
| TTFR (first audio) | 1–2s |
| Total response | 3–5s |
The Personality Layer
Jarvis has a defined persona: Alfred the butler. The system prompt enforces formal British courtesy, refers to the user as "sir," and maintains a professional tone. This isn't fluff — it shapes how users perceive the system. A warm butler taking time to think feels intentional. A robot hesitating feels broken.
System 2: Riya Voice Call — Phone-Based Agent
The Challenge
Inbound phone calls have radically different constraints than local assistants. Users expect sub-second response times (human conversation baseline is ~200-500ms). The audio codec is μ-law 8kHz (tiny bandwidth). Barge-in (interrupting) is critical. And you can't afford failures — a dropped call is a bad experience.
Architecture
Riya runs on a DigitalOcean VPS 2vCPU/8GB, handling inbound calls via Twilio Media Streams.
- Inbound audio: Twilio Media Streams (μ-law 8kHz, ~20ms frames, ~100ms network latency)
- Turn detection: Silero VAD (Voice Activity Detection, ~20ms inference)
- Speech-to-text: Groq Whisper (batch, silence-triggered, 150-320ms)
-
LLM inference:
- Primary: Groq llama-4-scout-17b (300-500ms)
- Fallback 1: Groq llama-3.3-70b-versatile (500-800ms)
- Fallback 2: Groq llama-3.1-8b-instant (200-400ms, different TPD pool)
- Text-to-speech: Smallest.ai Lightning with Pooja voice, batched by sentence (200-500ms per sentence, then sent to Twilio)
- Audio encode: PCM → μ-law 8kHz (~50ms)
- Outbound audio: Twilio Media Streams (~100ms network)
Total end-to-end: 0.5–1.1 seconds
Key Design Decisions
Why Scout LLM?
Scout is Groq's latest, optimized for latency. Time-to-first-token of ~80ms is critical in voice — the moment the first LLM token arrives, the entire pipeline can begin (TTS starts streaming immediately). I measured TTFT across providers and Scout dominated everything from OpenAI.
Why a three-tier fallback chain?
Groq's rate limits (TPD = tokens per day) are aggressive. During heavy voice call usage, the 70b-versatile hits its limit first. By chaining Scout (primary) → 70b (fallback 1) → 8b-instant (fallback 2), I can gracefully degrade without dropping calls. Each tier taps a different TPD pool.
Why μ-law optimization matters:
Voice calls compress audio to 8kHz μ-law — about 1/20th the bandwidth of full-fidelity speech. Most TTS services output CD-quality audio that gets mangled by the codec. I chose Smallest.ai primarily for Indian language pronunciation — it handles Indian names, places, and context naturally. Pooja's voice profile (crisp articulation, minimal breathiness) also survives μ-law compression well. It doesn't sound as natural as ElevenLabs in isolation, but it pronounces things correctly through telephony codecs, and that matters more in practice.
Why not streaming TTS on Riya?
I initially tried streaming TTS (like Jarvis), but μ-law compression fragments mid-stream audio. When TTS streams raw PCM chunks and Twilio compresses them live, the codec boundaries create audible chunking at random points. Solution: batch TTS by sentence boundaries. Each complete sentence gets compressed as a unit, then plays smoothly without artifacts. Adds ~100-200ms per sentence (wait for completion before sending), but the audio quality is worth it.
Why cap LLM output at 60 tokens?
This is a hidden latency lever. By limiting max_tokens=60, the LLM produces 1-2 short sentences max per turn. This means the "first sentence" is often the only sentence — so TTFR ≈ total response time. Longer responses would improve answer quality but destroy conversational pacing. For voice, brevity is speed.
Why Silero VAD, not cloud?
Cloud VAD services add 100-300ms of latency. Silero runs locally (~20ms). I trade perfect accuracy for speed — some false positives on pauses, but the user can interrupt, and that covers most cases.
Latency Breakdown
| Stage | Measured Time |
|---|---|
| Inbound (Twilio) | ~100ms |
| VAD (Silero, inline) | ~20ms |
| STT (Groq Whisper, batch) | 150-320ms |
| LLM first sentence (Scout) | 85-500ms |
| TTS first sentence (Smallest.ai) | 200-500ms |
| Audio encode + outbound | ~150ms |
| Measured TTFR | 0.5–1.1s |
Why does this add up? Two reasons:
STT is faster than expected. Groq Whisper transcription takes only 150-320ms for typical utterances (measured from production logs — not the 400-600ms I initially estimated from Whisper benchmarks).
LLM streaming + sentence-level TTS. Scout's TTFT (time-to-first-token) is ~80ms. The first complete sentence arrives in 85-500ms depending on complexity. TTS starts immediately on that first sentence — it doesn't wait for the full LLM response.
The critical path is: STT (~200ms) + LLM first sentence (~100ms) + TTS (~300ms) + network (~100ms) ≈ 700ms — which matches our measured 500-1100ms range.
The Codec Realization
The biggest learning: codec choice determines voice design AND architecture. In full-fidelity systems, warmth and dynamic range matter. In μ-law 8kHz, clarity and articulation dominate. Pooja's professional tone survives the compression because she enunciates precisely. Deepika (warm, young) sounded muffled and lifeless through the same pipeline. This isn't a bug in the TTS — it's a fundamental property of the compression.
Also: streaming TTS doesn't work well with μ-law. Raw PCM chunks get fragmented by the codec, causing audible artifacts. Solution: batch TTS by sentence boundaries so each complete audio unit compresses cleanly. This adds ~100-200ms wait per sentence but maintains audio quality.
Barge-In Implementation
When the user interrupts:
- Silero VAD detects speech onset
- Cancel flag is set — LLM generation stops, in-flight TTS requests are abandoned
- Output audio queue is drained (discard pending audio)
- Twilio media stream is cleared (suppress queued audio on Twilio's side)
All of this happens atomically via async/await. The latency from "user starts speaking" to "agent goes silent" is ~50-100ms — fast enough to feel natural.
System 3: Agent Tina — Alexa Skill
The Challenge
Alexa skills live in a walled garden. You don't control STT, NLU, or TTS — they're handled by Amazon's servers. Your job is to provide the intelligence layer. The constraint is strict: you get a text string (what the user said), you return text (what to say), and Alexa handles the rest.
Architecture
Tina is deployed via OpenClaw's custom gateway endpoint (invoked by Alexa) with persistent session context via ACP harness (OpenClaw's Anthropic-based coding environment).
- Voice input: Alexa ASR (proprietary, ~500-800ms)
- Intent detection: Alexa NLU (built-in slots/intents, ~100-200ms)
- Request parsing: Gateway endpoint receives Alexa intent
- LLM inference: Claude Haiku via OpenClaw gateway (400-800ms)
- Tool invocation: 10 integrated tools (web search via SearXNG, weather, etc.)
- Response generation: Claude crafts natural language response
- Voice output: Alexa TTS (~300-500ms)
Total end-to-end: 1.5–2.5 seconds
Key Design Decisions
Why Claude Haiku, not Groq?
Alexa skills are single-turn by default (user speaks once, agent responds). Haiku is optimized for quality in low-latency contexts. Groq's smaller models (8b) work but felt less natural. I chose quality over speed here because Alexa's front-end latency is already ~1s.
Why OpenClaw gateway?
Tina is built entirely on the OpenClaw gateway endpoint (no Lambda required). This gives us:
- Unified secret management (keys stay on VPS)
- Request logging and observability
- Easy model switching (Haiku → Opus if needed)
- Rate limiting and fallback handling at the gateway level
Why 10 tools instead of 3?
Tina started with web search only. I added weather, calendar, news, device control. Each tool adds latency if invoked, but Haiku is smart about tool selection — it only calls tools when needed. For most queries, Haiku answers directly (~400ms). Only tool-heavy queries hit the full 1.5-2.5s range.
Latency Breakdown
| Stage | Time |
|---|---|
| Alexa ASR | 500-800ms |
| Intent detection (NLU) | 100-200ms |
| LLM (Claude Haiku) | 400-800ms |
| Tool invocation (if needed) | 0-500ms |
| Alexa TTS | 300-500ms |
| Total | 1.5–2.5s |
The Ecosystem Play
Tina's design trades raw latency for ecosystem integration. Users can invoke Tina from any Alexa device, use voice commands naturally ("Alexa, ask Tina..."), and get context-aware responses. The 1.5-2.5s latency is acceptable because Alexa's native latency already dominates the budget. What matters is intelligence and naturalness.
Comparative Analysis: Three Systems, Three Choices
| Metric | Jarvis (Pi) | Riya (Phone) | Tina (Alexa) |
|---|---|---|---|
| TTFR (first audio) | 1–2s | 0.5–1.1s | 1.5–2.5s |
| STT provider | ElevenLabs | Groq Whisper | Alexa ASR |
| LLM | Groq Scout (+ fallback) | Groq Scout (+ fallbacks) | Claude Haiku |
| TTS | ElevenLabs (Benedict) + Smallest.ai (Chetan) | Smallest.ai (Pooja) | Alexa TTS |
| Network | Tailscale LAN | Twilio → VPS | Alexa → OpenClaw Gateway |
| Use case | Local smart home | Inbound + outbound calls | Alexa devices |
| Optimization focus | Quality over speed | Speed + resilience | Intelligence + integration |
Key insight: There is no "best" tech stack. Each system optimizes for its constraints. Jarvis's 1-2s TTFR is actually quite responsive for smart home voice — users hear feedback within a moment, even though the full response finishes over the next few seconds. Riya needs sub-1s TTFR because phone calls demand immediate responsiveness. Tina prioritizes integration over ultra-low latency because Alexa's ecosystem is the main value.
Learnings: What Actually Matters in Voice AI
1. Codec Design AND Language Shape Voice Choice
I learned this the hard way. Deepika sounds beautiful in high-fidelity environments (Jarvis testing) but lifeless through μ-law compression (phone calls). Pooja was designed for telephony codecs — she enunciates clearly, minimizes breathiness, survives bandwidth reduction. But there's another dimension: language and pronunciation. ElevenLabs sounds more natural overall, but Smallest.ai handles Indian names, places, and context correctly. In an Indian household or calling Indian phone numbers, pronunciation accuracy beats vocal polish.
Actionable: If you're building a voice system, match TTS voice to your codec AND your language context. Full bandwidth? Choose warm, dynamic voices. Telephony? Choose crisp, articulate voices. Non-English context? Choose a provider that actually handles your language.
2. TTFR (Time To First Response) — What Users Actually Feel
What matters is TTFR: the time from when the user stops speaking to when they hear the agent's voice.
TTFR = STT latency + TTFT (time-to-first-token) + TTS streaming + network
In Riya's pipeline:
- STT: 150-320ms (Groq Whisper batch)
- TTFT: ~80ms (Scout), first sentence: 85-500ms
- TTS first sentence: 200-500ms
- Network: ~100ms
- Measured TTFR: ~500–1100ms (sub-1s for most calls)
User perception:
- Below 1s = responsive, natural
- 1-2s = acceptable, slight pause noticeable
- Above 2s = slow, conversation feels broken
Jarvis's 1-2s TTFR (first audio) feels natural for smart home voice — the butler starts speaking quickly, and the full response finishes as the user listens. Put Riya on smart home and the same sub-1s response would feel jarring for something that should "think." Context matters.
Why TTFT matters internally: The moment the first LLM token arrives, you can stream to TTS. Every millisecond of TTFT directly reduces TTFR.
TTFT benchmarks across providers:
- Groq Scout: ~80ms TTFT
- Groq 70b: ~200ms TTFT
- OpenAI GPT-4o-mini: ~300-400ms TTFT
- Claude Haiku: ~200-300ms TTFT
Groq's infrastructure dominates for latency-sensitive workloads because Scout has the lowest TTFT.
Actionable: Optimize for TTFR (user perception), but understand that TTFT is the internal lever. Low TTFT = low TTFR = feels responsive.
3. Fallback Chains Are Underrated
Riya's three-tier fallback system has saved us dozens of times. When Groq's 70b TPD limit hits (usually around 3-5 concurrent calls), I transparently fall back to 8b-instant in a different pool. Users don't notice. Calls don't drop.
Most voice platforms (Vapi, ElevenLabs SDKs) don't ship with intelligent fallbacks. This was custom engineering, but it's paid dividends in production reliability.
Actionable: Build fallback chains into your voice systems. Different model sizes access different rate-limit pools. Graceful degradation beats silent failures.
4. Geography Matters — A Real Constraint
Nick Tikhonov's blog post showed that deploying orchestration in EU cut latency in half (1.7s → 0.8s). I'm US-deployed, which adds ~100-200ms per service hop. With Riya handling inbound calls from India, geography is becoming a real constraint.
The challenge: Groq and ElevenLabs don't have Indian datacenters. Any optimization would require deploying orchestration closer to India (Singapore, Hong Kong) and accepting the latency to US-based LLM/TTS services. This is a trade-off worth revisiting as call volume increases.
Actionable: Geography compounds across service hops. If you're serving a region far from your service providers, recognize the latency tax — it's baked in until providers expand regionally.
5. Voice Personality Is Not Afterthought
Jarvis's butler persona ("sir," formal tone, professional cadence) makes the latency feel intentional rather than broken. Riya's professional Pooja voice survives μ-law compression better than warm alternatives. Tina's spy/undercover vibe makes tool use feel like intelligence gathering, not API calling.
None of this is accidental. The system prompts, voice choices, and interaction patterns all reinforce the intended character.
Actionable: Define your voice system's personality early. It shapes technical choices (voice codec, LLM temperature, response style) downstream.
6. Barge-In (Interruption) Changes Everything
Jarvis doesn't support barge-in (local audio hardware can't capture while playing). Riya supports it flawlessly (parallel audio receive/process/send). Tina delegates to Alexa (which handles it natively).
Barge-in is the difference between "voice interface" and "conversation." It requires:
- Non-blocking turn detection
- Atomic cancellation of in-flight generation/TTS/audio
- Sub-100ms latency on the barge-in path
Most voice platforms underinvest here. I didn't, and it's why Riya feels natural.
Actionable: If you want natural voice, barge-in is non-negotiable. Plan for it architecturally.
Jarvis Optimization Journey: 7s → 1-2s TTFR
When I first built Jarvis, it took 7 seconds from wake word to first audio. Now it's 1-2 seconds. Here's what moved the needle:
What Didn't Help (Red Herrings)
- Switching from Claude Haiku to Groq Scout improved TTFT (~200ms → ~80ms), but LLM was never the main bottleneck
- Optimizing the gateway didn't matter (not in the critical path for direct Groq)
- Local models on Pi (didn't work, ARM architecture issues)
What Actually Helped (3 Big Wins + 1 Assist)
-
Adding Smallest.ai Lightning (1.1-1.9s → 0.3-0.4s for Indian context)
- ElevenLabs Benedict: warm but slow (1.1-1.9s) and mispronounces Indian words
- Added Smallest.ai Chetan: fast (300-400ms) and handles Indian pronunciation
- Cut latency by 60% for Indian-context responses
- Single biggest win
-
STT Provider Matters (cross-system learning)
- Jarvis uses ElevenLabs Scribe v2: 600-1000ms per transcription
- Riya uses Groq Whisper: 150-320ms (batch, but Groq's inference speed makes batch fast enough)
- Jarvis hasn't switched yet — but this cross-system comparison showed that provider choice matters as much as streaming vs batch
-
Streaming TTS Instead of Batch
- Before: Wait for full TTS file, then play
- After: Stream audio as it's generated, user hears first words in 0.1-0.5s
- Enabled TTFR optimization (first audio vs total response)
-
Direct Groq API (Not Gateway)
- Groq Scout TTFT: ~80ms, first sentence in ~100ms (vs gateway which adds network hops)
- Keeps primary queries fast, gateway only for complex escalations
Result: 7s → 1-2s TTFR (5x improvement). The user hears Jarvis responding within 2 seconds, even though the full response finishes over 3-5 seconds.
Real-World Wins: Proof It Works
Architecture and latency metrics are nice, but what matters is whether these systems actually work for real tasks. Here are the things I've successfully done with these voice agents:
Jarvis (Pi - Smart Home):
- Made my home lights go disco mode
- Asked to play a good South Indian crime thriller on my TV
- Handles smart home commands naturally
Riya (Phone - Outbound Calls):
- Called my dad to let him know I'd be home late
- Successfully booked a real salon appointment
- Placed outbound calls with multi-turn conversations
- Relayed information accurately (times, confirmations)
- Handled interruptions naturally (barge-in)
Tina (Alexa - Ecosystem Integration):
- Weather queries, news briefings
- Smart home control via voice
- Tool invocation for web searches
These aren't benchmarks. They're actual uses. The systems work because the architecture is right — not because of raw speed, but because of thoughtful trade-offs.
What I'd Do Differently
1. Riya: Use Deepgram Flux Instead of Silero VAD
Deepgram Flux (semantic turn detection) would eliminate false positives and add ~200ms latency improvement. I haven't switched because Silero is working well and Flux requires API changes. But if I were rebuilding, that's the move.
2. Tina: Custom Prompt Caching
Alexa calls tend to be similar (weather, news, smart home control). Caching the system prompt across calls could save 100-200ms. Claude's prompt caching feature makes this feasible.
Metrics That Matter (And Don't)
Metrics that matter:
- Time to First Response (TTFR) — user-perceived latency (stop speaking → hear voice)
- Time-to-first-token (TTFT) — the lever you pull to improve TTFR
- Barge-in latency — can user interrupt without talking over agent?
- Voice codec survival — does voice sound natural through compression?
- Fallback chain activation rate — does system degrade gracefully?
- User perception of naturalness — does it feel like conversation or robot?
Metrics that don't (yet):
- Raw end-to-end latency (below ~1s for phone, below ~5s for home assistants, it's psychology not physics)
- Model size (quality matters, size doesn't)
- Number of tools (intelligence matters, tool count doesn't)
What makes this work:
- Sub-1s typical latency: Conversation feels natural, no awkward pauses (typical range: 490-1090ms)
- Barge-in: The caller interrupts ("Can you repeat that?") and Riya doesn't talk over them
- Phone number handling: Hardcoded regex recognizes phone numbers, relays digit-by-digit clearly
- Context awareness: Riya remembers the 4 p.m. time, restates it in confirmation
- Fallback reliability: All Groq calls succeeded (no model fallbacks needed)
This is what sub-1s latency enables — actual useful work, not just impressive benchmarks.
Conclusion
I built three voice systems because one couldn't satisfy three different use cases. The learnings are system-agnostic:
- Understand your constraints. Local audio? Telephony? Ecosystem integration? Choose tech accordingly.
- Optimize the bottleneck. For voice, that's TTFT + barge-in. Not overall latency.
- Personality + reliability > raw speed. Users prefer a slightly slower, predictable system to a faster, flaky one.
- Fallback chains save you. Graceful degradation is worth the engineering cost.
- Voice is an orchestration problem. The magic is in how you wire components together, not the components themselves.
If you're building voice AI, these systems aren't blueprints to copy — they're case studies in making trade-offs. Your constraints will be different. Your choices should be too.
Have similar projects? Share in the comments — I'd love to hear how your voice stacks compare.
This blog was co-authored by Riya, my personal AI assistant.
Top comments (0)