Most voice AI tutorials end at "call the ElevenLabs API."
That's not a platform. That's a demo that breaks the moment ElevenLabs changes pricing.
I spent 30 days building Mithivoices — an open-source TTS/STT platform with 19+ neural voices, 8 languages (including Hindi, Malayalam, Marathi), and 442ms end-to-end latency. The key design decision: swap between Piper, ElevenLabs, OpenAI, or Coqui with one config change — no code touches.
This is the full architecture. Copy what you need.
Table of Contents
- The Real Problem With Cloud-Only Voice AI
- What We're Building
- Step 1: The Abstraction Layer
- Step 2: The Config System
- Step 3: Real-Time Audio Engine
- Step 4: LangGraph for Stateful Memory
- Step 5: How We Hit 442ms
- Project Structure
- Running It
- What to Build Next
- Three Things I'd Tell Myself
The Real Problem With Cloud-Only Voice AI {#the-real-problem}
Every production voice system hits the same wall eventually:
| Problem | What breaks |
|---|---|
| ElevenLabs raises prices 3x | Your unit economics collapse overnight |
| OpenAI changes the Whisper endpoint | Your STT pipeline goes down |
| You need Hindi or Malayalam support | Your English-first provider fails you |
| Client wants offline deployment | Cloud-first architecture is unusable |
The fix isn't to pick the "best" provider. It's to build an abstraction layer that makes the provider irrelevant.
What We're Building {#what-were-building}
A provider-agnostic voice AI platform:
- TTS: Piper TTS locally by default → swap to ElevenLabs/OpenAI via config
- STT: Whisper STT locally by default → swap to cloud Whisper/Deepgram via config
- Orchestration: FastAPI + Redis WebSocket engine for real-time bidirectional audio
- Memory: LangGraph stateful agents — conversational context persists across turns
- Target: Sub-500ms end-to-end latency (we hit 442ms)
Step 1: The Abstraction Layer (Most Important Part) {#step-1-abstraction-layer}
Build this first. Everything else plugs into it.
# voice/base.py
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Optional
@dataclass
class TTSConfig:
provider: str # "piper" | "elevenlabs" | "openai" | "coqui"
voice_id: str
language: str
speed: float = 1.0
@dataclass
class STTConfig:
provider: str # "whisper_local" | "whisper_api" | "deepgram"
model_size: str = "base"
language: Optional[str] = None
class TTSProvider(ABC):
@abstractmethod
async def synthesize(self, text: str, config: TTSConfig) -> bytes:
"""Returns raw audio bytes (WAV format)"""
pass
class STTProvider(ABC):
@abstractmethod
async def transcribe(self, audio_bytes: bytes, config: STTConfig) -> str:
"""Returns transcribed text"""
pass
Now implement Piper as your local default:
# voice/providers/piper_tts.py
import asyncio
import tempfile
import os
from voice.base import TTSProvider, TTSConfig
class PiperTTSProvider(TTSProvider):
def __init__(self, model_dir: str = "./models/tts"):
self.model_dir = model_dir
async def synthesize(self, text: str, config: TTSConfig) -> bytes:
model_path = f"{self.model_dir}/{config.language}/{config.voice_id}.onnx"
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
output_path = tmp.name
try:
cmd = [
"piper",
"--model", model_path,
"--output_file", output_path,
"--sentence_silence", "0.1"
]
proc = await asyncio.create_subprocess_exec(
*cmd,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
await proc.communicate(input=text.encode())
with open(output_path, "rb") as f:
return f.read()
finally:
os.unlink(output_path)
And ElevenLabs as a drop-in cloud swap — same interface, zero code change in your app:
# voice/providers/elevenlabs_tts.py
import httpx
from voice.base import TTSProvider, TTSConfig
class ElevenLabsTTSProvider(TTSProvider):
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://api.elevenlabs.io/v1"
async def synthesize(self, text: str, config: TTSConfig) -> bytes:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{self.base_url}/text-to-speech/{config.voice_id}",
headers={"xi-api-key": self.api_key},
json={
"text": text,
"model_id": "eleven_multilingual_v2",
"voice_settings": {"speed": config.speed}
}
)
return response.content
Step 2: The Config System (One File to Rule Them All) {#step-2-config-system}
# config.yml — only file you edit to swap providers
tts:
provider: piper # swap to: elevenlabs | openai | coqui
voice_id: en_US-lessac-medium
language: en_US
speed: 1.0
stt:
provider: whisper_local # swap to: whisper_api | deepgram
model_size: base # tiny | base | small | medium | large
language: null # null = auto-detect
elevenlabs:
api_key: ${ELEVENLABS_API_KEY}
# 8 language packs supported
languages:
- en_US
- hi_IN # Hindi
- ml_IN # Malayalam
- mr_IN # Marathi
- ta_IN # Tamil
- te_IN # Telugu
- bn_IN # Bengali
- gu_IN # Gujarati
Provider factory — loads the right class based on config:
# voice/factory.py
import yaml
from voice.providers.piper_tts import PiperTTSProvider
from voice.providers.elevenlabs_tts import ElevenLabsTTSProvider
from voice.providers.openai_tts import OpenAITTSProvider # implement same interface
def get_tts_provider(config: dict):
provider = config["tts"]["provider"]
if provider == "piper":
return PiperTTSProvider(model_dir="./models/tts")
elif provider == "elevenlabs":
return ElevenLabsTTSProvider(api_key=config["elevenlabs"]["api_key"])
elif provider == "openai":
return OpenAITTSProvider(api_key=config["openai"]["api_key"])
else:
raise ValueError(f"Unknown TTS provider: {provider}")
Each provider (
openai_tts.py,coqui_tts.py, etc.) is its own file implementing the sameTTSProviderinterface. Full implementations are in the repo underbackend/.
Step 3: Real-Time Audio Engine (FastAPI + Redis) {#step-3-audio-engine}
This is where latency lives or dies. WebSocket for bidirectional streaming, Redis to buffer between handler and pipeline.
# backend/app/main.py
import json
import time
import redis.asyncio as aioredis
import yaml
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from voice.factory import get_tts_provider, get_stt_provider
from agents.conversation import build_conversation_graph
app = FastAPI()
with open("config.yml") as f:
config = yaml.safe_load(f)
tts = get_tts_provider(config)
stt = get_stt_provider(config)
redis_client = aioredis.Redis(host="localhost", port=6379, decode_responses=False)
agent_graph = build_conversation_graph(redis_client)
@app.websocket("/voice")
async def voice_endpoint(websocket: WebSocket):
await websocket.accept()
session_id = str(id(websocket))
try:
while True:
audio_data = await websocket.receive_bytes()
t_start = time.perf_counter()
# STT → LangGraph agent → TTS
transcript = await stt.transcribe(audio_data, config["stt"])
result = await agent_graph.ainvoke(
{"messages": [{"role": "user", "content": transcript}],
"session_id": session_id,
"last_action": ""},
config={"configurable": {"thread_id": session_id}}
)
response_text = result["messages"][-1]["content"]
audio_response = await tts.synthesize(response_text, config["tts"])
latency_ms = (time.perf_counter() - t_start) * 1000
await websocket.send_bytes(audio_response)
await websocket.send_text(json.dumps({
"transcript": transcript,
"latency_ms": round(latency_ms, 1)
}))
except WebSocketDisconnect:
pass
Why Redis and not in-memory queues? I tried in-memory first. Under any real load, frames dropped. Redis adds ~3ms of latency and makes the system production-stable under concurrent sessions.
Step 4: LangGraph for Stateful Memory {#step-4-langgraph}
Without this, every turn is a fresh conversation. With this, the agent remembers context across the entire session.
# agents/conversation.py
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.aioredis import AsyncRedisSaver
from typing import TypedDict, List, Literal
class ConversationState(TypedDict):
messages: List[dict]
session_id: str
last_action: str
async def understand_intent(state: ConversationState) -> ConversationState:
# Classify the user's intent — respond or act (e.g. call an API)
last_msg = state["messages"][-1]["content"].lower()
action = "act" if any(w in last_msg for w in ["book", "schedule", "find", "search"]) else "respond"
return {**state, "last_action": action}
async def generate_response(state: ConversationState) -> ConversationState:
# Your LLM call here — Groq recommended for lowest latency
# groq_client.chat.completions.create(...)
reply = {"role": "assistant", "content": "Response from LLM"}
return {**state, "messages": state["messages"] + [reply]}
async def execute_action(state: ConversationState) -> ConversationState:
# Call external APIs, query a DB, book appointments, etc.
result = {"role": "tool", "content": "Action completed"}
return {**state, "messages": state["messages"] + [result]}
def route_intent(state: ConversationState) -> Literal["respond", "act"]:
return state["last_action"] if state["last_action"] in ("respond", "act") else "respond"
def build_conversation_graph(redis_client):
checkpointer = AsyncRedisSaver(redis_client) # memory persists in Redis
graph = StateGraph(ConversationState)
graph.add_node("understand", understand_intent)
graph.add_node("respond", generate_response)
graph.add_node("act", execute_action)
graph.add_conditional_edges("understand", route_intent,
{"respond": "respond", "act": "act"})
graph.add_edge("act", "respond")
graph.add_edge("respond", END)
graph.set_entry_point("understand")
return graph.compile(checkpointer=checkpointer)
The agent can step out of conversation (book an appointment, query a database) and step back in with the result — all while keeping full context via Redis checkpointing.
Step 5: How We Hit 442ms {#step-5-442ms}
Here's the full latency breakdown:
| Stage | Time | How |
|---|---|---|
| Audio receive + Redis buffer | ~10ms | Avoid disk I/O entirely |
| Whisper STT (local) | ~150ms |
base model, not medium
|
| LLM response via Groq | ~180ms | Groq LPU — fastest available inference |
| Piper TTS synthesis | ~80ms | ONNX model + GPU-accelerated ops |
| WebSocket send | ~22ms | Compressed chunks, not raw WAV |
| Total | ~442ms |
Two decisions that mattered most:
1. Use Groq for LLM inference. Their LPU hardware is 10-20x faster than standard GPU inference for this workload. The difference between Groq and a typical cloud LLM endpoint is ~300ms on average. That's the gap between feeling robotic and feeling conversational.
2. base Whisper model, not medium. Accuracy drops ~4% but latency drops ~110ms. For real-time conversation, that trade is correct 100% of the time. You can always run medium for post-processing transcripts offline.
Project Structure {#project-structure}
ai-voice-platform/
├── backend/
│ ├── app/
│ │ └── main.py # FastAPI + WebSocket engine
│ ├── tts.py # Piper TTS integration
│ └── llm/ # LLM support
├── frontend/ # React + Vite UI
├── models/
│ └── tts/ # Piper ONNX voice models (download separately)
├── voice_assets/
├── docs/
│ ├── PRD.md
│ └── TRD.md
├── requirements.txt
├── download_models.py # Downloads voice models from Hugging Face
├── start_backend.bat # Windows: backend only
└── start_all.bat # Windows: full stack
Running It {#running-it}
# Clone
git clone https://github.com/mithivoices/ai-voice-platform
cd ai-voice-platform
# Python dependencies
pip install -r requirements.txt
# Frontend dependencies
cd frontend && npm install && cd ..
# Download voice models (~570MB — NOT included in repo)
python download_models.py
# Windows — start everything:
start_all.bat
# Linux/Mac — run in two terminals:
# Terminal 1:
python -m uvicorn backend.app.main:app --port 8000
# Terminal 2:
cd frontend && npm run dev
Frontend: http://localhost:5173 · API: http://localhost:8000
Available endpoints:
| Endpoint | Method | What it does |
|---|---|---|
/health |
GET | Server status |
/api/voices |
GET | List available voices |
/api/languages |
GET | List supported languages |
/api/tts/generate |
POST | Generate audio |
/voice |
WebSocket | Real-time bidirectional audio |
What to Build Next {#what-to-build-next}
Once you have it running:
-
Add a Hindi voice agent — change
language: hi_INin config, Hindi model downloads automatically - Offline deployment — Piper + Whisper run locally, no internet required. Runs on Raspberry Pi 5.
- Streaming TTS — stream audio chunks as they're synthesized instead of waiting for full response
- Add an "act" node — wire the LangGraph agent to a calendar API, database, or booking system
Three Things I'd Tell Myself Before Starting {#three-things}
Abstract first, implement second. I built Piper directly into the WebSocket handler on my first attempt. Ripping it out to add the abstraction layer cost 3 days. Build the interface before any implementation.
Latency compounds. Every 50ms saved in one stage doesn't just save 50ms — it changes the feel of the entire conversation. The difference between 600ms and 442ms is the difference between robotic and real.
Redis is non-negotiable for audio pipelines. In-memory queues work fine in testing. Under concurrent sessions, they drop frames. Redis adds ~3ms and makes the system production-stable.
Full source, PRD, and TRD docs at github.com/mithivoices/ai-voice-platform.
If you build something with it — an IVR system, a Hindi voice assistant, an offline kiosk — drop it in the comments. Want to see what people actually ship with it.
What's the hardest latency bottleneck you've hit in a voice AI pipeline?
This article was written with AI assistance and reviewed for technical accuracy by Aryan Panwar, who built and shipped Mithivoices.
📌 Full case study and project breakdown at aryanpanwar.in
Top comments (0)