We built a production voice AI platform that handles inbound calls for businesses — answering phones, booking appointments, qualifying leads, and pushing structured data into CRMs. Not a demo. Not a weekend hack. A multi-tenant platform serving real customers who get angry when calls drop.
This is what we learned.
The Problem with Existing Platforms
The hosted voice AI platforms — Retell, Vapi, Bland, and others — solve a real bootstrapping problem. You can get a voice agent on a phone number in an afternoon. But the moment you need production-grade control, the walls close in.
Per-minute pricing at $0.07–0.15/min eats your margins alive when you're building a SaaS on top. You're locked into their prompt formats, their latency characteristics, their integration limitations. When something breaks at 2am, you're filing a support ticket instead of reading a stack trace.
We wanted three things: full control over the voice pipeline latency, the ability to plug into any CRM without waiting on someone else's roadmap, and unit economics that let us build a real business on top. So we built the platform ourselves.
Architecture Overview
The system separates into two layers that communicate over internal APIs:
┌─────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ (Billing, Auth, CRM Adapters, Client Management) │
│ Node.js / Express / MySQL / Stripe │
└──────────────────────┬──────────────────────────────┘
│ REST + Webhooks
┌──────────────────────▼──────────────────────────────┐
│ VOICE ENGINE │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ SIP │ │ WebRTC │ │ PSTN │ │
│ │ Trunk │◄──►│ Gateway │ │ Bridge │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ STREAMING PIPELINE │ │
│ │ │ │
│ │ Audio In ──► STT ──► LLM ──► TTS ──► │ │
│ │ Audio Out │ │
│ │ │ │
│ │ [Barge-in detector] [Buffer manager] │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Python / asyncio / SIP stack │
└─────────────────────────────────────────────────────┘
The key architectural decision was this separation. The voice engine knows nothing about billing, CRM integrations, or client management. It handles calls, streams audio, manages the STT-LLM-TTS pipeline, and fires webhooks when things happen. The orchestrator handles everything else.
This separation means the engine can evolve independently. We can swap TTS providers, change STT models, or rearchitect the audio pipeline without touching billing code. It also means the orchestrator — a conventional Node.js app — handles all the "normal SaaS" concerns without being coupled to real-time audio processing.
SIP trunking handles PSTN calls. WebRTC handles browser-based calls from our portal. Both feed into the same streaming pipeline.
Latency Optimisation: Where the Milliseconds Go
Voice AI has one unforgiving constraint: if the pause between a caller finishing their sentence and the AI starting its response exceeds about 800ms, the conversation feels broken. People start saying "hello?" or talking over the agent. Our target was sub-600ms end-to-end.
Here's where the time budget goes:
STT: Endpoint Detection Matters More Than Accuracy
We use streaming transcription — audio chunks flow to the STT provider continuously, and partial transcripts come back in real time. The critical tuning parameter isn't model accuracy. It's utterance_end_ms: how long the system waits after the caller stops speaking before it finalises the transcript and triggers the LLM.
Set it too low and you cut people off mid-sentence. Set it too high and you add hundreds of milliseconds of dead air. We settled on tuning this per-agent based on use case. A receptionist agent for a plumbing company gets a longer window than a booking confirmation flow.
Cost: ~$0.008/min. Latency contribution: 50–150ms depending on endpoint detection settings.
LLM: Time to First Token Is Everything
For voice, you don't care about tokens-per-second throughput. You care about TTFT — time to first token. That's the delay between sending your prompt and receiving the first token of the response. Everything after that streams.
We benchmarked extensively:
| Provider | TTFT (p50) | Notes |
|---|---|---|
| Groq (LPU) | ~350ms | Consistently fast, limited model selection |
| GPU-based inference (various) | 500–800ms | Varies wildly by load |
| Cerebras | ~200ms | Incredible raw speed, but US-only regions |
Cerebras looked phenomenal on paper. Then we measured from Sydney. The 150ms+ round-trip to US datacenters ate the advantage entirely. Geography matters when you're counting milliseconds.
We run Groq for the primary path with GPU-based fallback. Cost: ~$0.002/min.
TTS: The Streaming Trick
We evaluated eight TTS engines. The results were decisive:
| Engine | Latency to first audio | Quality | Notes |
|---|---|---|---|
| Commercial API A | ~100ms | Excellent | Streaming, good voice selection |
| Commercial API B | ~90ms | Good | Fast, limited voices |
| Self-hosted (GPU) | ~330ms | Fair | RunPod, required warm instances |
| Self-hosted (CPU) | ~2000ms | Fair | Unusable for real-time |
The insight that cut our end-to-end latency by 30%: start TTS on the first sentence boundary while the LLM is still generating. Don't wait for the complete response.
# Pseudocode for the streaming pipeline
async def handle_llm_stream(llm_response_stream):
sentence_buffer = ""
async for token in llm_response_stream:
sentence_buffer += token
if ends_with_sentence_boundary(sentence_buffer):
# Fire TTS immediately on the completed sentence
# Don't wait for the rest of the LLM response
await tts_engine.synthesize_streaming(sentence_buffer)
sentence_buffer = ""
# Flush any remaining partial sentence
if sentence_buffer:
await tts_engine.synthesize_streaming(sentence_buffer)
This means the caller hears the first sentence of the response while the LLM is still generating sentence two. The perceived latency drops dramatically.
Barge-In: The Cancellable Buffer
When a caller interrupts, you need to stop playback immediately. This means the TTS output buffer must be cancellable — you can't just pipe audio to the SIP channel and forget about it. We maintain a reference to the current playback stream and clear it the moment the STT detects new speech during agent output.
Get this wrong and the agent talks over the caller. Get it right and the conversation feels natural.
Total pipeline latency (p50): ~500ms from end of caller speech to first audio response.
CRM Integration: The Adapter Pattern
A voice AI agent that can't push data into your CRM is a parlour trick. The call ends, and then what? Someone reads a transcript and manually creates a job? That's not automation.
We built a provider-agnostic adapter layer:
# Pseudocode — the adapter interface
class CRMAdapter:
async def find_customer(self, phone, name) -> Customer
async def create_lead(self, lead_data) -> Lead
async def create_appointment(self, slot, customer) -> Appointment
async def create_job(self, description, customer) -> Job
# Each CRM gets one implementation file
class ServiceM8Adapter(CRMAdapter): ...
class FergusAdapter(CRMAdapter): ...
class XeroAdapter(CRMAdapter): ...
Post-call, the LLM runs a structured analysis pass over the transcript: extract caller intent, job type, urgency, contact details, preferred appointment times. That structured output feeds directly into whichever CRM the client has connected.
Adding a new CRM integration means writing one adapter file. No database schema changes, no new API routes, no changes to the voice engine. The orchestrator resolves the correct adapter at runtime based on the client's configuration.
This sounds simple. It took us three attempts to get the abstraction right (more on that below).
Multi-Tenancy and White-Labelling
The platform serves multiple businesses, each with their own:
- Phone numbers and SIP configuration
- Agent personalities, system prompts, and voice selection
- CRM connections and integration credentials
- Billing profiles and usage tiers
On top of that, we support resellers who white-label the entire platform — custom domains, custom branding, their own billing relationship with their clients. OAuth callback flows for third-party integrations (Google Calendar, CRM providers) route through a single callback domain regardless of white-label configuration. The tenant context is encoded in the OAuth state parameter and resolved on callback.
The key constraint: tenant isolation must be absolute. One client's call data, transcripts, and CRM credentials must never leak to another. This is table-stakes for B2B SaaS but easy to mess up when you're also managing real-time audio streams and concurrent call sessions.
What We Got Wrong
The Self-Hosted TTS Detour
We spent weeks trying to own the TTS layer. The reasoning was sound: TTS is the single most expensive component per-minute, and if we could self-host, we'd control both cost and quality.
We fine-tuned StyleTTS2 on Australian English speech data sourced from public domain audiobooks. We ran inference on RunPod GPU instances. We deployed to Google Cloud Run with L4 GPUs for auto-scaling.
The result: latency was 3x worse than commercial APIs, and voice quality from short-sample cloning was muddy and inconsistent. The fine-tuned model was better than base, but nowhere near commercial quality. A single $48/mo droplet calling a commercial TTS API outperformed a $200/mo GPU instance running our own model.
Commercial TTS won on speed, quality, and — after accounting for GPU costs — price. We shelved self-hosted TTS entirely.
The Integration Layer Was in the Wrong Place
First iteration: CRM integrations lived in the client-facing application. The voice engine would fire a webhook, the client app would process it, and the client app would call the CRM.
This broke the moment we added the second client-facing application (the white-label portal). Now two apps needed the same integration logic. We refactored integrations into the orchestrator layer where they belong — a platform concern, not a presentation concern.
Lesson: if two frontends need the same business logic, it belongs in the platform.
Silent Failures When Swapping Providers
Swapping a TTS provider sounds like changing a URL and an API key. In practice, it surfaced audio format mismatches (PCM vs. MP3 vs. Opus), sample rate conflicts (16kHz vs. 24kHz vs. 44.1kHz), WAV header parsing edge cases, and in one memorable afternoon, a tensor device mismatch where the STT model was on GPU but the audio preprocessing was returning CPU tensors.
None of these threw obvious errors. The calls just sounded bad, or had weird pauses, or silently dropped audio frames. We now have integration tests that validate the full audio pipeline end-to-end for every provider combination.
Cost Breakdown
Per-minute cost at current volume:
| Component | Cost/min |
|---|---|
| STT (streaming) | ~$0.008 |
| LLM (inference) | ~$0.002 |
| TTS (commercial API) | ~$0.005 |
| Telephony (SIP trunking) | ~$0.010 |
| Platform overhead | ~$0.010 |
| Total | ~$0.035/min |
Compared to hosted platforms at $0.07–0.15/min, this gives us room to build a margin-positive SaaS without passing eye-watering per-minute rates to customers.
Infrastructure: the voice engine runs on a single $48/mo DigitalOcean droplet in SYD1 (Sydney region — latency to Australian callers matters). The orchestrator runs on Railway. No GPU required in production. Total infrastructure cost under $100/mo before per-minute API charges.
The Expert Generalist Argument
Look at the breadth of this stack: SIP telephony, real-time audio streaming, WebRTC, asyncio concurrency, LLM prompt engineering, TTS/STT integration, OAuth 2.0 flows, Stripe billing, multi-tenant database design, React frontend, and half a dozen CRM API integrations.
A traditional team would split this across a telephony engineer, an ML/AI engineer, a backend developer, a frontend developer, and a DevOps person. Five people, five communication boundaries, five calendars to coordinate.
We shipped it with one engineer and AI-assisted development. Not because AI wrote the code — it didn't architect the adapter pattern or decide to separate the engine from the orchestrator. But it dramatically compressed the time spent on implementation details, boilerplate, and the long tail of integration edge cases.
The moat here isn't any single technology. Every component is available off-the-shelf: commercial STT, commercial TTS, hosted LLM inference, SIP trunking providers, open-source WebRTC libraries. The value is in the integration density — making all of these work together reliably, at production latency, with proper error handling, billing, and multi-tenancy. That's an orchestration problem, and orchestration is where the generalist thrives.
What's Next
The adapter pattern for CRM integrations is the most leveraged piece of this architecture. Every new integration multiplies the platform's addressable market without touching the core voice engine. We're expanding the adapter interface to handle bidirectional sync — not just pushing call data into CRMs, but pulling customer context into the agent's prompt before the call even connects.
Voice AI is heading toward commodity infrastructure. The models will get faster and cheaper. Latency will shrink. The differentiation will be in what happens after the call — how well the data flows into the systems businesses already use, how seamlessly the voice agent fits into existing workflows.
The platform that wins won't be the one with the best voice model. It'll be the one that makes the phone call disappear into the business process, as if a competent human handled it and did all the paperwork too.
If you're building something similar and want to compare notes on SIP integration, TTS benchmarking, or the adapter pattern — I'm always up for a technical conversation.
Top comments (0)