Aryan Panwar

Posted on Jun 10 • Originally published at aryanpanwar.in

I built a voice AI platform that hits 442ms latency. Here's the full architecture.

#ai #python #opensource #tutorial

Most voice AI tutorials end at "call the ElevenLabs API."

That's not a platform. That's a demo that breaks the moment ElevenLabs changes pricing.

I spent 30 days building Mithivoices — an open-source TTS/STT platform with 19+ neural voices, 8 languages (including Hindi, Malayalam, Marathi), and 442ms end-to-end latency. The key design decision: swap between Piper, ElevenLabs, OpenAI, or Coqui with one config change — no code touches.

This is the full architecture. Copy what you need.

The Real Problem With Cloud-Only Voice AI
What We're Building
Step 1: The Abstraction Layer
Step 2: The Config System
Step 3: Real-Time Audio Engine
Step 4: LangGraph for Stateful Memory
Step 5: How We Hit 442ms
Project Structure
Running It
What to Build Next
Three Things I'd Tell Myself

The Real Problem With Cloud-Only Voice AI {#the-real-problem}

Every production voice system hits the same wall eventually:

Problem	What breaks
ElevenLabs raises prices 3x	Your unit economics collapse overnight
OpenAI changes the Whisper endpoint	Your STT pipeline goes down
You need Hindi or Malayalam support	Your English-first provider fails you
Client wants offline deployment	Cloud-first architecture is unusable

The fix isn't to pick the "best" provider. It's to build an abstraction layer that makes the provider irrelevant.

What We're Building {#what-were-building}

A provider-agnostic voice AI platform:

TTS: Piper TTS locally by default → swap to ElevenLabs/OpenAI via config
STT: Whisper STT locally by default → swap to cloud Whisper/Deepgram via config
Orchestration: FastAPI + Redis WebSocket engine for real-time bidirectional audio
Memory: LangGraph stateful agents — conversational context persists across turns
Target: Sub-500ms end-to-end latency (we hit 442ms)

Step 1: The Abstraction Layer (Most Important Part) {#step-1-abstraction-layer}

Build this first. Everything else plugs into it.

# voice/base.py
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Optional

@dataclass
class TTSConfig:
    provider: str  # "piper" | "elevenlabs" | "openai" | "coqui"
    voice_id: str
    language: str
    speed: float = 1.0

@dataclass
class STTConfig:
    provider: str  # "whisper_local" | "whisper_api" | "deepgram"
    model_size: str = "base"
    language: Optional[str] = None

class TTSProvider(ABC):
    @abstractmethod
    async def synthesize(self, text: str, config: TTSConfig) -> bytes:
        """Returns raw audio bytes (WAV format)"""
        pass

class STTProvider(ABC):
    @abstractmethod
    async def transcribe(self, audio_bytes: bytes, config: STTConfig) -> str:
        """Returns transcribed text"""
        pass

Now implement Piper as your local default:

# voice/providers/piper_tts.py
import asyncio
import tempfile
import os
from voice.base import TTSProvider, TTSConfig

class PiperTTSProvider(TTSProvider):
    def __init__(self, model_dir: str = "./models/tts"):
        self.model_dir = model_dir

    async def synthesize(self, text: str, config: TTSConfig) -> bytes:
        model_path = f"{self.model_dir}/{config.language}/{config.voice_id}.onnx"

        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            output_path = tmp.name

        try:
            cmd = [
                "piper",
                "--model", model_path,
                "--output_file", output_path,
                "--sentence_silence", "0.1"
            ]
            proc = await asyncio.create_subprocess_exec(
                *cmd,
                stdin=asyncio.subprocess.PIPE,
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE
            )
            await proc.communicate(input=text.encode())

            with open(output_path, "rb") as f:
                return f.read()
        finally:
            os.unlink(output_path)

And ElevenLabs as a drop-in cloud swap — same interface, zero code change in your app:

# voice/providers/elevenlabs_tts.py
import httpx
from voice.base import TTSProvider, TTSConfig

class ElevenLabsTTSProvider(TTSProvider):
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.elevenlabs.io/v1"

    async def synthesize(self, text: str, config: TTSConfig) -> bytes:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/text-to-speech/{config.voice_id}",
                headers={"xi-api-key": self.api_key},
                json={
                    "text": text,
                    "model_id": "eleven_multilingual_v2",
                    "voice_settings": {"speed": config.speed}
                }
            )
            return response.content

Step 2: The Config System (One File to Rule Them All) {#step-2-config-system}

# config.yml — only file you edit to swap providers

tts:
  provider: piper           # swap to: elevenlabs | openai | coqui
  voice_id: en_US-lessac-medium
  language: en_US
  speed: 1.0

stt:
  provider: whisper_local   # swap to: whisper_api | deepgram
  model_size: base          # tiny | base | small | medium | large
  language: null            # null = auto-detect

elevenlabs:
  api_key: ${ELEVENLABS_API_KEY}

# 8 language packs supported
languages:
  - en_US
  - hi_IN    # Hindi
  - ml_IN    # Malayalam
  - mr_IN    # Marathi
  - ta_IN    # Tamil
  - te_IN    # Telugu
  - bn_IN    # Bengali
  - gu_IN    # Gujarati

Provider factory — loads the right class based on config:

# voice/factory.py
import yaml
from voice.providers.piper_tts import PiperTTSProvider
from voice.providers.elevenlabs_tts import ElevenLabsTTSProvider
from voice.providers.openai_tts import OpenAITTSProvider  # implement same interface

def get_tts_provider(config: dict):
    provider = config["tts"]["provider"]
    if provider == "piper":
        return PiperTTSProvider(model_dir="./models/tts")
    elif provider == "elevenlabs":
        return ElevenLabsTTSProvider(api_key=config["elevenlabs"]["api_key"])
    elif provider == "openai":
        return OpenAITTSProvider(api_key=config["openai"]["api_key"])
    else:
        raise ValueError(f"Unknown TTS provider: {provider}")

Each provider (openai_tts.py, coqui_tts.py, etc.) is its own file implementing the same TTSProvider interface. Full implementations are in the repo under backend/.

Step 3: Real-Time Audio Engine (FastAPI + Redis) {#step-3-audio-engine}

This is where latency lives or dies. WebSocket for bidirectional streaming, Redis to buffer between handler and pipeline.

# backend/app/main.py
import json
import time
import redis.asyncio as aioredis
import yaml
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from voice.factory import get_tts_provider, get_stt_provider
from agents.conversation import build_conversation_graph

app = FastAPI()

with open("config.yml") as f:
    config = yaml.safe_load(f)

tts = get_tts_provider(config)
stt = get_stt_provider(config)
redis_client = aioredis.Redis(host="localhost", port=6379, decode_responses=False)
agent_graph = build_conversation_graph(redis_client)

@app.websocket("/voice")
async def voice_endpoint(websocket: WebSocket):
    await websocket.accept()
    session_id = str(id(websocket))

    try:
        while True:
            audio_data = await websocket.receive_bytes()
            t_start = time.perf_counter()

            # STT → LangGraph agent → TTS
            transcript = await stt.transcribe(audio_data, config["stt"])
            result = await agent_graph.ainvoke(
                {"messages": [{"role": "user", "content": transcript}],
                 "session_id": session_id,
                 "last_action": ""},
                config={"configurable": {"thread_id": session_id}}
            )
            response_text = result["messages"][-1]["content"]
            audio_response = await tts.synthesize(response_text, config["tts"])

            latency_ms = (time.perf_counter() - t_start) * 1000

            await websocket.send_bytes(audio_response)
            await websocket.send_text(json.dumps({
                "transcript": transcript,
                "latency_ms": round(latency_ms, 1)
            }))

    except WebSocketDisconnect:
        pass

Why Redis and not in-memory queues? I tried in-memory first. Under any real load, frames dropped. Redis adds ~3ms of latency and makes the system production-stable under concurrent sessions.

Step 4: LangGraph for Stateful Memory {#step-4-langgraph}

Without this, every turn is a fresh conversation. With this, the agent remembers context across the entire session.

# agents/conversation.py
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.aioredis import AsyncRedisSaver
from typing import TypedDict, List, Literal

class ConversationState(TypedDict):
    messages: List[dict]
    session_id: str
    last_action: str

async def understand_intent(state: ConversationState) -> ConversationState:
    # Classify the user's intent — respond or act (e.g. call an API)
    last_msg = state["messages"][-1]["content"].lower()
    action = "act" if any(w in last_msg for w in ["book", "schedule", "find", "search"]) else "respond"
    return {**state, "last_action": action}

async def generate_response(state: ConversationState) -> ConversationState:
    # Your LLM call here — Groq recommended for lowest latency
    # groq_client.chat.completions.create(...)
    reply = {"role": "assistant", "content": "Response from LLM"}
    return {**state, "messages": state["messages"] + [reply]}

async def execute_action(state: ConversationState) -> ConversationState:
    # Call external APIs, query a DB, book appointments, etc.
    result = {"role": "tool", "content": "Action completed"}
    return {**state, "messages": state["messages"] + [result]}

def route_intent(state: ConversationState) -> Literal["respond", "act"]:
    return state["last_action"] if state["last_action"] in ("respond", "act") else "respond"

def build_conversation_graph(redis_client):
    checkpointer = AsyncRedisSaver(redis_client)  # memory persists in Redis
    graph = StateGraph(ConversationState)

    graph.add_node("understand", understand_intent)
    graph.add_node("respond", generate_response)
    graph.add_node("act", execute_action)

    graph.add_conditional_edges("understand", route_intent,
                                {"respond": "respond", "act": "act"})
    graph.add_edge("act", "respond")
    graph.add_edge("respond", END)
    graph.set_entry_point("understand")

    return graph.compile(checkpointer=checkpointer)

The agent can step out of conversation (book an appointment, query a database) and step back in with the result — all while keeping full context via Redis checkpointing.

Step 5: How We Hit 442ms {#step-5-442ms}

Here's the full latency breakdown:

Stage	Time	How
Audio receive + Redis buffer	~10ms	Avoid disk I/O entirely
Whisper STT (local)	~150ms	`base` model, not `medium`
LLM response via Groq	~180ms	Groq LPU — fastest available inference
Piper TTS synthesis	~80ms	ONNX model + GPU-accelerated ops
WebSocket send	~22ms	Compressed chunks, not raw WAV
Total	~442ms

Two decisions that mattered most:

1. Use Groq for LLM inference. Their LPU hardware is 10-20x faster than standard GPU inference for this workload. The difference between Groq and a typical cloud LLM endpoint is ~300ms on average. That's the gap between feeling robotic and feeling conversational.

2. base Whisper model, not medium. Accuracy drops ~4% but latency drops ~110ms. For real-time conversation, that trade is correct 100% of the time. You can always run medium for post-processing transcripts offline.

Project Structure {#project-structure}

ai-voice-platform/
├── backend/
│   ├── app/
│   │   └── main.py           # FastAPI + WebSocket engine
│   ├── tts.py                # Piper TTS integration
│   └── llm/                  # LLM support
├── frontend/                 # React + Vite UI
├── models/
│   └── tts/                  # Piper ONNX voice models (download separately)
├── voice_assets/
├── docs/
│   ├── PRD.md
│   └── TRD.md
├── requirements.txt
├── download_models.py        # Downloads voice models from Hugging Face
├── start_backend.bat         # Windows: backend only
└── start_all.bat             # Windows: full stack

Running It {#running-it}

# Clone
git clone https://github.com/mithivoices/ai-voice-platform
cd ai-voice-platform

# Python dependencies
pip install -r requirements.txt

# Frontend dependencies
cd frontend && npm install && cd ..

# Download voice models (~570MB — NOT included in repo)
python download_models.py

# Windows — start everything:
start_all.bat

# Linux/Mac — run in two terminals:
# Terminal 1:
python -m uvicorn backend.app.main:app --port 8000
# Terminal 2:
cd frontend && npm run dev

Frontend: http://localhost:5173 · API: http://localhost:8000

Available endpoints:

Endpoint	Method	What it does
`/health`	GET	Server status
`/api/voices`	GET	List available voices
`/api/languages`	GET	List supported languages
`/api/tts/generate`	POST	Generate audio
`/voice`	WebSocket	Real-time bidirectional audio

What to Build Next {#what-to-build-next}

Once you have it running:

Add a Hindi voice agent — change language: hi_IN in config, Hindi model downloads automatically
Offline deployment — Piper + Whisper run locally, no internet required. Runs on Raspberry Pi 5.
Streaming TTS — stream audio chunks as they're synthesized instead of waiting for full response
Add an "act" node — wire the LangGraph agent to a calendar API, database, or booking system

Three Things I'd Tell Myself Before Starting {#three-things}

Abstract first, implement second. I built Piper directly into the WebSocket handler on my first attempt. Ripping it out to add the abstraction layer cost 3 days. Build the interface before any implementation.

Latency compounds. Every 50ms saved in one stage doesn't just save 50ms — it changes the feel of the entire conversation. The difference between 600ms and 442ms is the difference between robotic and real.

Redis is non-negotiable for audio pipelines. In-memory queues work fine in testing. Under concurrent sessions, they drop frames. Redis adds ~3ms and makes the system production-stable.

Full source, PRD, and TRD docs at github.com/mithivoices/ai-voice-platform.

If you build something with it — an IVR system, a Hindi voice assistant, an offline kiosk — drop it in the comments. Want to see what people actually ship with it.

What's the hardest latency bottleneck you've hit in a voice AI pipeline?

This article was written with AI assistance and reviewed for technical accuracy by Aryan Panwar, who built and shipped Mithivoices.

📌 Full case study and project breakdown at aryanpanwar.in

DEV Community