Matt Redman

Posted on Mar 30

Building a Production Voice AI Platform from Scratch — Architecture, Latency, and Lessons

#ai #python #voice #architecture

We built a production voice AI platform that handles inbound calls for businesses — answering phones, booking appointments, qualifying leads, and pushing structured data into CRMs. Not a demo. Not a weekend hack. A multi-tenant platform serving real customers who get angry when calls drop.

This is what we learned.

The Problem with Existing Platforms

The hosted voice AI platforms — Retell, Vapi, Bland, and others — solve a real bootstrapping problem. You can get a voice agent on a phone number in an afternoon. But the moment you need production-grade control, the walls close in.

Per-minute pricing at $0.07–0.15/min eats your margins alive when you're building a SaaS on top. You're locked into their prompt formats, their latency characteristics, their integration limitations. When something breaks at 2am, you're filing a support ticket instead of reading a stack trace.

We wanted three things: full control over the voice pipeline latency, the ability to plug into any CRM without waiting on someone else's roadmap, and unit economics that let us build a real business on top. So we built the platform ourselves.

Architecture Overview

The system separates into two layers that communicate over internal APIs:

┌─────────────────────────────────────────────────────┐
│                   ORCHESTRATOR                       │
│  (Billing, Auth, CRM Adapters, Client Management)   │
│  Node.js / Express / MySQL / Stripe                  │
└──────────────────────┬──────────────────────────────┘
                       │ REST + Webhooks
┌──────────────────────▼──────────────────────────────┐
│                  VOICE ENGINE                         │
│                                                       │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐          │
│  │   SIP   │    │ WebRTC  │    │  PSTN   │          │
│  │ Trunk   │◄──►│ Gateway │    │ Bridge  │          │
│  └────┬────┘    └────┬────┘    └────┬────┘          │
│       │              │              │                │
│       └──────────────┼──────────────┘                │
│                      ▼                               │
│  ┌──────────────────────────────────────────┐       │
│  │         STREAMING PIPELINE               │       │
│  │                                          │       │
│  │  Audio In ──► STT ──► LLM ──► TTS ──►  │       │
│  │                                Audio Out │       │
│  │                                          │       │
│  │  [Barge-in detector]  [Buffer manager]   │       │
│  └──────────────────────────────────────────┘       │
│                                                       │
│  Python / asyncio / SIP stack                        │
└─────────────────────────────────────────────────────┘

The key architectural decision was this separation. The voice engine knows nothing about billing, CRM integrations, or client management. It handles calls, streams audio, manages the STT-LLM-TTS pipeline, and fires webhooks when things happen. The orchestrator handles everything else.

This separation means the engine can evolve independently. We can swap TTS providers, change STT models, or rearchitect the audio pipeline without touching billing code. It also means the orchestrator — a conventional Node.js app — handles all the "normal SaaS" concerns without being coupled to real-time audio processing.

SIP trunking handles PSTN calls. WebRTC handles browser-based calls from our portal. Both feed into the same streaming pipeline.

Latency Optimisation: Where the Milliseconds Go

Voice AI has one unforgiving constraint: if the pause between a caller finishing their sentence and the AI starting its response exceeds about 800ms, the conversation feels broken. People start saying "hello?" or talking over the agent. Our target was sub-600ms end-to-end.

Here's where the time budget goes:

STT: Endpoint Detection Matters More Than Accuracy

We use streaming transcription — audio chunks flow to the STT provider continuously, and partial transcripts come back in real time. The critical tuning parameter isn't model accuracy. It's utterance_end_ms: how long the system waits after the caller stops speaking before it finalises the transcript and triggers the LLM.

Set it too low and you cut people off mid-sentence. Set it too high and you add hundreds of milliseconds of dead air. We settled on tuning this per-agent based on use case. A receptionist agent for a plumbing company gets a longer window than a booking confirmation flow.

Cost: ~$0.008/min. Latency contribution: 50–150ms depending on endpoint detection settings.

LLM: Time to First Token Is Everything

For voice, you don't care about tokens-per-second throughput. You care about TTFT — time to first token. That's the delay between sending your prompt and receiving the first token of the response. Everything after that streams.

We benchmarked extensively:

Provider	TTFT (p50)	Notes
Groq (LPU)	~350ms	Consistently fast, limited model selection
GPU-based inference (various)	500–800ms	Varies wildly by load
Cerebras	~200ms	Incredible raw speed, but US-only regions

Cerebras looked phenomenal on paper. Then we measured from Sydney. The 150ms+ round-trip to US datacenters ate the advantage entirely. Geography matters when you're counting milliseconds.

We run Groq for the primary path with GPU-based fallback. Cost: ~$0.002/min.

TTS: The Streaming Trick

We evaluated eight TTS engines. The results were decisive:

Engine	Latency to first audio	Quality	Notes
Commercial API A	~100ms	Excellent	Streaming, good voice selection
Commercial API B	~90ms	Good	Fast, limited voices
Self-hosted (GPU)	~330ms	Fair	RunPod, required warm instances
Self-hosted (CPU)	~2000ms	Fair	Unusable for real-time

The insight that cut our end-to-end latency by 30%: start TTS on the first sentence boundary while the LLM is still generating. Don't wait for the complete response.

# Pseudocode for the streaming pipeline
async def handle_llm_stream(llm_response_stream):
    sentence_buffer = ""

    async for token in llm_response_stream:
        sentence_buffer += token

        if ends_with_sentence_boundary(sentence_buffer):
            # Fire TTS immediately on the completed sentence
            # Don't wait for the rest of the LLM response
            await tts_engine.synthesize_streaming(sentence_buffer)
            sentence_buffer = ""

    # Flush any remaining partial sentence
    if sentence_buffer:
        await tts_engine.synthesize_streaming(sentence_buffer)

This means the caller hears the first sentence of the response while the LLM is still generating sentence two. The perceived latency drops dramatically.

Barge-In: The Cancellable Buffer

When a caller interrupts, you need to stop playback immediately. This means the TTS output buffer must be cancellable — you can't just pipe audio to the SIP channel and forget about it. We maintain a reference to the current playback stream and clear it the moment the STT detects new speech during agent output.

Get this wrong and the agent talks over the caller. Get it right and the conversation feels natural.

Total pipeline latency (p50): ~500ms from end of caller speech to first audio response.

CRM Integration: The Adapter Pattern

A voice AI agent that can't push data into your CRM is a parlour trick. The call ends, and then what? Someone reads a transcript and manually creates a job? That's not automation.

We built a provider-agnostic adapter layer:

# Pseudocode — the adapter interface
class CRMAdapter:
    async def find_customer(self, phone, name) -> Customer
    async def create_lead(self, lead_data) -> Lead
    async def create_appointment(self, slot, customer) -> Appointment
    async def create_job(self, description, customer) -> Job

# Each CRM gets one implementation file
class ServiceM8Adapter(CRMAdapter): ...
class FergusAdapter(CRMAdapter): ...
class XeroAdapter(CRMAdapter): ...

Post-call, the LLM runs a structured analysis pass over the transcript: extract caller intent, job type, urgency, contact details, preferred appointment times. That structured output feeds directly into whichever CRM the client has connected.

Adding a new CRM integration means writing one adapter file. No database schema changes, no new API routes, no changes to the voice engine. The orchestrator resolves the correct adapter at runtime based on the client's configuration.

This sounds simple. It took us three attempts to get the abstraction right (more on that below).

Multi-Tenancy and White-Labelling

The platform serves multiple businesses, each with their own:

Phone numbers and SIP configuration
Agent personalities, system prompts, and voice selection
CRM connections and integration credentials
Billing profiles and usage tiers

On top of that, we support resellers who white-label the entire platform — custom domains, custom branding, their own billing relationship with their clients. OAuth callback flows for third-party integrations (Google Calendar, CRM providers) route through a single callback domain regardless of white-label configuration. The tenant context is encoded in the OAuth state parameter and resolved on callback.

The key constraint: tenant isolation must be absolute. One client's call data, transcripts, and CRM credentials must never leak to another. This is table-stakes for B2B SaaS but easy to mess up when you're also managing real-time audio streams and concurrent call sessions.

What We Got Wrong

The Self-Hosted TTS Detour

We spent weeks trying to own the TTS layer. The reasoning was sound: TTS is the single most expensive component per-minute, and if we could self-host, we'd control both cost and quality.

We fine-tuned StyleTTS2 on Australian English speech data sourced from public domain audiobooks. We ran inference on RunPod GPU instances. We deployed to Google Cloud Run with L4 GPUs for auto-scaling.

The result: latency was 3x worse than commercial APIs, and voice quality from short-sample cloning was muddy and inconsistent. The fine-tuned model was better than base, but nowhere near commercial quality. A single $48/mo droplet calling a commercial TTS API outperformed a $200/mo GPU instance running our own model.

Commercial TTS won on speed, quality, and — after accounting for GPU costs — price. We shelved self-hosted TTS entirely.

The Integration Layer Was in the Wrong Place

First iteration: CRM integrations lived in the client-facing application. The voice engine would fire a webhook, the client app would process it, and the client app would call the CRM.

This broke the moment we added the second client-facing application (the white-label portal). Now two apps needed the same integration logic. We refactored integrations into the orchestrator layer where they belong — a platform concern, not a presentation concern.

Lesson: if two frontends need the same business logic, it belongs in the platform.

Silent Failures When Swapping Providers

Swapping a TTS provider sounds like changing a URL and an API key. In practice, it surfaced audio format mismatches (PCM vs. MP3 vs. Opus), sample rate conflicts (16kHz vs. 24kHz vs. 44.1kHz), WAV header parsing edge cases, and in one memorable afternoon, a tensor device mismatch where the STT model was on GPU but the audio preprocessing was returning CPU tensors.

None of these threw obvious errors. The calls just sounded bad, or had weird pauses, or silently dropped audio frames. We now have integration tests that validate the full audio pipeline end-to-end for every provider combination.

Cost Breakdown

Per-minute cost at current volume:

Component	Cost/min
STT (streaming)	~$0.008
LLM (inference)	~$0.002
TTS (commercial API)	~$0.005
Telephony (SIP trunking)	~$0.010
Platform overhead	~$0.010
Total	~$0.035/min

Compared to hosted platforms at $0.07–0.15/min, this gives us room to build a margin-positive SaaS without passing eye-watering per-minute rates to customers.

Infrastructure: the voice engine runs on a single $48/mo DigitalOcean droplet in SYD1 (Sydney region — latency to Australian callers matters). The orchestrator runs on Railway. No GPU required in production. Total infrastructure cost under $100/mo before per-minute API charges.

The Expert Generalist Argument

Look at the breadth of this stack: SIP telephony, real-time audio streaming, WebRTC, asyncio concurrency, LLM prompt engineering, TTS/STT integration, OAuth 2.0 flows, Stripe billing, multi-tenant database design, React frontend, and half a dozen CRM API integrations.

A traditional team would split this across a telephony engineer, an ML/AI engineer, a backend developer, a frontend developer, and a DevOps person. Five people, five communication boundaries, five calendars to coordinate.

We shipped it with one engineer and AI-assisted development. Not because AI wrote the code — it didn't architect the adapter pattern or decide to separate the engine from the orchestrator. But it dramatically compressed the time spent on implementation details, boilerplate, and the long tail of integration edge cases.

The moat here isn't any single technology. Every component is available off-the-shelf: commercial STT, commercial TTS, hosted LLM inference, SIP trunking providers, open-source WebRTC libraries. The value is in the integration density — making all of these work together reliably, at production latency, with proper error handling, billing, and multi-tenancy. That's an orchestration problem, and orchestration is where the generalist thrives.

What's Next

The adapter pattern for CRM integrations is the most leveraged piece of this architecture. Every new integration multiplies the platform's addressable market without touching the core voice engine. We're expanding the adapter interface to handle bidirectional sync — not just pushing call data into CRMs, but pulling customer context into the agent's prompt before the call even connects.

Voice AI is heading toward commodity infrastructure. The models will get faster and cheaper. Latency will shrink. The differentiation will be in what happens after the call — how well the data flows into the systems businesses already use, how seamlessly the voice agent fits into existing workflows.

The platform that wins won't be the one with the best voice model. It'll be the one that makes the phone call disappear into the business process, as if a competent human handled it and did all the paperwork too.

If you're building something similar and want to compare notes on SIP integration, TTS benchmarking, or the adapter pattern — I'm always up for a technical conversation.

Top comments (2)

Andre Cytryn • Mar 30

the sentence-boundary TTS trick is underrated. starting synthesis on the first complete sentence while the LLM is still generating probably matters more than shaving latency at any other layer. curious whether you found sentence detection tricky in practice — things like mid-sentence pauses or short responses that are a single sentence, does the buffer flush handle those cleanly?

also the Cerebras geography lesson is a good one. TTFT benchmarks measured from the wrong region are basically useless.

Matt Redman • Mar 30

Sentence detection in practice is messier than the pseudocode suggests. Punctuation-based boundary detection handles the happy path cleanly, and the trailing buffer flush takes care of short or single-sentence responses — if the entire reply is one sentence, it just fires on the flush. The edge cases are the interesting ones.

Embellishing the mid-sentence pause rate made endpoint detection tuning genuinely difficult. I spent a lot of time tweaking utterance_end_ms windows trying to find a reliable threshold — too tight and you’re triggering synthesis on incomplete thoughts, too loose and you’re adding dead air. What actually worked was a different approach: send to the model on what looks like a boundary, but keep the VAD running and cancel the TTS stream the moment new speech is detected. Yes, you burn a few tokens on the occasional false-positive trigger, but the conversation feel is dramatically better. Barge-in rate dropped significantly, and even when it did occur it felt natural rather than jarring — because the agent wasn’t three sentences deep into a response before realising the caller had started talking.