Platon

Posted on Apr 15 • Originally published at meduzzen.com

How We Built an AI Voice Agent: Backend Architecture Guide

#ai #architecture #backend #performance

Key takeaways

Deepgram consistently outperformed Whisper by 3x to 4x on word error rate in our benchmarks (1.53% vs 6.82% on clean test audio). On noisy production calls, the gap widened further. Switching STT was the highest-impact decision of the project.
End-to-end latency dropped from 10 seconds to 1.5 seconds across four optimization stages. The biggest initial bottleneck was not the LLM. It was the embedding workers for RAG.
The real cost of running an AI voice agent is approximately $0.50 to $0.70 per minute. Not the $0.05 per minute that platform vendors advertise. LLM inference across Groq and OpenAI GPT accounts for roughly half of that.
Kubernetes with Karpenter autoscaling handles 50 parallel call workers across up to 18 nodes. AWS Lambda was rejected because voice calls can exceed its 15-minute execution limit.
Pre-scripted phrases cut perceived latency to under 500 milliseconds for common responses. The trade-off: less flexibility, more stability.

This is not an API tutorial

Most guides on how to build an AI voice agent follow the same formula. Connect a speech-to-text API. Wire it to an LLM. Pipe the response through text-to-speech. Make a test call to your own phone. Publish the tutorial.

That is not what this article covers.

This is the architecture behind a production system built for a US-based B2B client who needs to validate business contacts at scale. The system dials real phone numbers, navigates IVR menus, identifies whether a human or a machine answered, converses with receptionists, and logs structured verification data into the client's CRM.

The client's target is 200,000 outbound calls per month. The system is currently in production testing at several hundred calls per day, scaling toward that volume. A previous team spent months attempting to build the same system and did not reach production. Our team of four engineers plus DevOps picked it up in October 2025 and reached production by mid-March 2026.

What follows is everything I learned about building the backend infrastructure. The parts that broke. The decisions that survived. The numbers nobody else publishes.

The pipeline that looks simple and is not

Every AI voice agent runs on the same conceptual pipeline.

Audio comes in from the phone line. Speech-to-text converts it to a transcript. The transcript feeds into an LLM that generates a response. Text-to-speech converts that response back to audio. The audio goes out through the phone line.

Five steps. Five API calls. It should take a weekend to build.

It does not.

Each step operates under constraints that do not exist in a chatbot. Latency is measured in milliseconds, not seconds. Audio quality is degraded by telephony compression (8kHz G.711, which strips frequencies above 4kHz). Background noise triggers false transcriptions. Users interrupt mid-sentence. IVR menus play hold music that confuses every model in the pipeline.

Our stack evolved through three major phases before reaching its current state:

Speech-to-text: Google STT → AssemblyAI → Whisper → Deepgram Nova-3
LLM inference: GPT OSS 120B on Groq → Llama 70B on Groq (with rate limit management across 50 parallel calls via Groq's enterprise tier)
Text-to-speech: ElevenLabs (pre-scripted phrases for common responses, generated audio for complex dialogue)
Voice Activity Detection: FunASR-VAD (with custom DSP heuristics for music filtering) → TEN-VAD (tested and rejected: detected music as speech) → Silero VAD + Deepgram's built-in VAD (hybrid approach, requiring both systems to confirm silence before triggering transcription)
NLU classification: Regex rules → LLM classification → fine-tuned BERT neural network (for detecting voicemail, IVR menus, and live operators)
Infrastructure: Single AWS instance → Kubernetes + Terraform with Karpenter autoscaling

Every transition in this list represents weeks of work and at least one production failure that forced the change.

Deepgram vs Whisper: production benchmarks from real calls

This section exists because every comparison of Deepgram vs Whisper available online is written by someone selling one of those products. We tested both in production. Here is what we found.

Word error rate: Deepgram outperformed Whisper by 3x to 4x

Our team ran internal benchmarks across both models on call audio samples.

Model	WER	CER	Avg latency	Best for
Deepgram Nova-3	1.53%	0.57%	~300ms (streaming)	Production voice agents
Groq Whisper Large v3	6.82%	2.23%	~650ms (batch)	Offline transcription

Important context: these benchmarks were run on relatively clean test recordings. Deepgram's own published benchmarks report a median WER of 5.26% on diverse enterprise audio, and independent evaluations place it around 5.4%. Whisper Large v3 scores between 5.6% and 10% depending on the dataset, with catastrophic spikes on noisy telephony audio where hallucinations push error rates far higher.

Our test samples skewed cleaner than typical production calls, which explains the unusually low absolute numbers. The relative gap is what matters: Deepgram consistently outperformed Whisper by 3x to 4x regardless of audio conditions. On noisy production lines, the gap widened, not narrowed.

The keyword boosting problem that killed our calls

Whisper does not handle keyword boosting well. In our system, correctly transcribing the name of a company and the name of a contact person is the entire point of the call. If the STT misses the company name, the LLM cannot verify it. The call fails. The client rejects the result.

Whisper's prompting mechanism is designed for style control, not keyword reinforcement. We observed two failure patterns:

Hallucinated company names from silence. During hold music or dead air, Whisper generated company names that did not exist in the audio. Our Voice Activity Detection was weaker at that stage (FunASR-VAD, before we built custom DSP filtering and eventually switched to Silero), so the system fed silent audio or music into Whisper. Whisper filled the void with fabricated transcriptions.

Dropped speech on voicemail messages. When a voicemail greeting played ("Please leave a message after the tone"), Whisper sometimes returned an empty transcript. The system logged no context. The agent sat in silence. The call received an incorrect status.

Deepgram supports dedicated keyword parameter arrays that adjust probability weighting during transcription. Company names and contact names are injected before each call. The difference in downstream accuracy was immediate.

Streaming latency: 300ms vs 650ms

Whisper is not a streaming model. It processes audio in batch, consuming fixed-length chunks. To simulate real-time behavior, we sliced audio into short segments and sent overlapping payloads. This introduced latency, increased API request volume, and created transcript errors at chunk boundaries where words were split between requests.

We spent approximately one month trying to make Whisper work in a streaming pipeline. We experimented with passing previous context, accumulating audio buffers, and adjusting chunk sizes. None of it produced acceptable results for live calls.

Deepgram is a streaming-first architecture. Final transcripts arrive in approximately 300 milliseconds after end of speech. Partial transcripts arrive continuously as the person speaks, which allows our system to begin preparing the LLM context before the speaker finishes their sentence.

The switch from Whisper to Deepgram was the single highest-impact technical decision of the project. If I were starting over, I would use a streaming STT from day one and keep Whisper strictly for offline post-call transcription.

How we cut voice agent latency from 10 seconds to 1.5

When the system first went live, end-to-end latency was 5 to 10 seconds. A person would finish speaking and wait in silence for up to 10 seconds before the agent responded. That is not a conversation. That is a failed phone call.

Research on conversational turn-taking shows the natural gap between speakers averages approximately 200 milliseconds. Anything above 800 milliseconds feels unnatural. Above 1.5 seconds, callers assume they are speaking to a machine.

We reached 1.5 to 2.5 seconds through four sequential optimizations.

Stage 1: The embedding bottleneck nobody warns you about

The largest initial bottleneck was not the LLM. It was the embedding workers handling RAG (Retrieval-Augmented Generation) lookups.

Our agent uses a knowledge base to retrieve call scripts, company information, and dialogue rules. Each RAG query required generating embeddings, running vector similarity search, and returning context to the LLM. Our initial deployment allocated insufficient resources to this process.

Embedding queries were taking 1 to 2 seconds per call turn. Under load, they timed out entirely after 3 seconds, stalling the entire pipeline.

We increased the number of embedding pods and allocated more powerful nodes. Embedding latency dropped to approximately 250 milliseconds on average. This single infrastructure change cut total latency by more than a second.

Stage 2: Trading GPT-120B for Llama 70B

We initially ran GPT OSS 120B on Groq for LLM inference. The model produced high-quality responses but consumed significant processing time.

Switching to Llama 70B Versatile reduced LLM response latency by roughly half. The quality trade-off was manageable. Llama 70B is harder to prompt-engineer for strict compliance with complex call scripts, but it generates responses 2 to 3 times faster for the same token count.

This is a fundamental trade-off in voice AI. Heavier models reason better. Lighter models respond faster. In a voice pipeline where every 100 milliseconds matters, speed usually wins.

Stage 3: Custom local models replacing LLM calls

Not every decision in the pipeline requires a full LLM inference call. Determining whether the audio contains a human voice, a voicemail greeting, or an IVR menu prompt is a classification task, not a generation task.

We replaced several LLM-based classification steps with locally trained models, including a fine-tuned BERT network for call type classification and answering machine detection. These run on the call worker pods without external API calls, eliminating network round-trip latency for classification decisions entirely.

The VAD problem: three attempts before it worked

Voice Activity Detection determines when someone is speaking and when to start processing. It sounds simple. It is the source of some of our worst failures.

FunASR-VAD (first attempt). The model could not distinguish hold music from human speech. When a call went to hold, the system treated the music as someone talking and fed it into the STT pipeline. I built custom digital signal processing heuristics on top of FunASR to filter music and tonal audio. It helped, but the underlying model was still unreliable.

TEN-VAD (second attempt). We tested it as a replacement. It had the same critical failure: music detected as speech. For a cold calling system where 30% or more of calls encounter hold music or IVR prompts, this is not a minor issue. It breaks the entire pipeline downstream. We rejected it.

Silero VAD (current). Combined with Deepgram's built-in VAD in a hybrid configuration. Our rule: both systems must independently confirm silence before we trigger transcription. This dual-confirmation approach eliminated nearly all false positives from music and background noise, at the cost of slightly slower endpointing.

Stage 4: Pre-scripted phrases as a latency hack

The most counterintuitive optimization was removing the LLM from the initial response entirely.

For predictable conversational turns (greetings, confirmation phrases, filler responses), we pre-recorded audio using ElevenLabs and stored it locally. When the system detects a greeting or a simple acknowledgment, it fires the pre-recorded audio instantly while the LLM processes the complex response in the background.

Scripted phrase latency: under 500 milliseconds. LLM-generated response latency: 1.5 to 2.5 seconds.

The client did not object to partially scripted responses. They cared about one metric: prospects stopped hanging up during the first three seconds.

Current latency budget

Component	Current latency	At project start
Deepgram STT	300ms (avg), up to 1s on noisy audio	500-700ms (Whisper)
LLM inference (Groq)	500ms (median), 700ms (mean), 1.2s (P90)	2-3s (GPT 120B)
ElevenLabs TTS	~300ms time to first byte	Same
RAG embeddings	~250ms	1-2s (timing out at 3s)
Total end-to-end	1.5-2.5s	5-10s

The current bottleneck remains LLM inference. It accounts for roughly 40% of total pipeline latency. Further improvement requires either a lighter model (with quality trade-offs) or self-hosted inference on GPU nodes (with cost trade-offs the client is not willing to absorb).

Kubernetes architecture for scaling to 200,000 calls

The system started on a single large AWS instance. Expensive during idle hours. A bottleneck during peak load. We rebuilt the entire infrastructure on Kubernetes with Terraform and Karpenter for dynamic autoscaling.

Why Lambda fails for voice agents

AWS Lambda was considered and rejected for two reasons.

Lambda has a hard execution limit of 15 minutes. Our average call duration is 2 to 3 minutes, but outlier calls reach 10 to 20 minutes. The system has a hard limit of 30 minutes per call. Lambda cannot guarantee support for this.

Each call worker requires approximately 2 GB of RAM to hold local classification models, audio buffers, and connection state. At that memory footprint, Lambda's per-millisecond billing model becomes significantly more expensive than dedicated Kubernetes pods running on-demand instances.

Pod layout: call workers, embedders, and the scheduler

The cluster runs on AWS c7i.xlarge instances:

Call worker nodes: Up to 18 nodes, each running 4 call worker pods at 2 GB RAM each. Karpenter dynamically scales from zero to 50 workers based on active call load.
Embedding nodes: 5 dedicated nodes run the RAG embedding workers. Separated from call workers to prevent memory contention during high-concurrency periods.
Scheduler node: A dedicated node runs the call scheduling service that manages the queue, assigns calls to available workers, and handles retry logic for failed calls.
Backend node: Runs the API server, CRM integration layer, and post-call result processing pipeline.

At maximum capacity (50 parallel calls running simultaneously), the system processes a full cycle of approximately 200 calls every 20 minutes.

Infrastructure costs

Scenario	Monthly AWS cost
Base infrastructure (no active calls)	~$1,200
Production load (1-2 hours/day calling)	~$1,600-1,800
Full capacity (all 18 worker nodes 24/7)	~$7,800

The base cost covers everything that runs whether or not calls are happening: the EKS control plane ($73), master and admin nodes ($130 each), RDS database ($261), VPC NAT gateway ($66), load balancer ($22), ECR ($20), and storage.

What it actually costs: ~$0.50 to $0.70 per minute

This is the number nobody in the voice AI industry publishes clearly. Platform vendors advertise $0.05 to $0.10 per minute. That covers the orchestration fee. The real cost, including every API call and every minute of compute, is roughly 10x higher.

Our measured cost breakdown per minute of call time

Component	Cost per minute	Notes
Deepgram STT	$0.0092	Streaming transcription
LLM inference (Groq + OpenAI GPT)	~$0.31	~20 calls/min during + post-call analysis
AssemblyAI	$0.0035	Post-call transcription for QA
ElevenLabs TTS	~$0.10	Generated speech for non-scripted responses
Infrastructure + telephony	~$0.20	AWS, SIP, service subscriptions
Total	~$0.62	Varies by call complexity

The LLM line requires a breakdown. During a live call, the system makes approximately 10 Llama 70B calls per minute through Groq plus 2 supervisor calls per minute through a larger model. After each call ends, 3 GPT calls handle status extraction and summarization, and another 10 Llama calls handle structured field population.

This cost structure determines whether building a custom engineering team makes financial sense. At low call volumes, platforms like Retell AI may be more cost-effective. At high volumes with complex multi-step call logic, a custom system gives you control over every cost lever.

When one component fails, everything fails

Voice AI pipelines are cascade-failure machines. A problem in one component does not degrade gracefully. It propagates through the entire system and destroys the call.

The IVR cascade: how a greeting destroyed a menu

An automated phone system answered with a pre-recorded message: "Hello, thank you for calling. Press 1 for billing, press 2 for sales, press 0 to reach an operator."

The word "Hello" at the beginning triggered our human detection classifier. The system concluded a real person had answered and immediately played the agent's greeting.

That greeting interrupted the IVR menu playback. The system missed the list of options. Because the IVR was voice-activated, it interpreted our agent's greeting as input. The menu started responding to our premature speech. We could no longer press the correct button.

The call failed. The client rejected the result.

There is no reliable way to determine from a transcript alone whether a voice is pre-recorded or live. This class of failure required building a separate classification layer: a fine-tuned BERT model that detects IVR patterns from transcript structure rather than acoustic features.

The noise cascade: transcription error to delay to hang-up

On a noisy phone line, the STT returned a garbled transcript of the operator's greeting. Instead of proceeding with the call script, the agent asked the operator to repeat themselves.

This request took time to generate. Noisy audio requires longer STT processing, which delayed the LLM's input. By the time the agent's response reached the operator, the operator had already repeated their greeting independently.

Both parties were now speaking simultaneously. The agent tried to respond to the first greeting while the operator's second greeting arrived. Responses went out of sync. The operator hung up.

One transcription error caused a timing delay. The delay caused speaker overlap. The overlap caused call abandonment. This is what cascade failure looks like in voice AI. It does not degrade. It collapses.

How we handle failures now

The system relies on hard timeouts for all external services: Groq, Deepgram, ElevenLabs. If a service call exceeds its timeout threshold, the system retries. If retries exceed a configurable limit, the call is terminated and marked for re-dial.

What I would do differently on day one

Start with streaming STT. We spent a month trying to force Whisper into a streaming pipeline. That month produced no usable results. Deepgram, AssemblyAI, or Google Cloud Speech-to-Text should have been the starting point. Whisper belongs in offline transcription, not real-time voice agents.

Constrain the scope before scaling the logic. The most expensive mistake was trying to handle every possible call scenario simultaneously. The correct approach: identify the 10 most common call outcomes. Build perfect handling for those 10. Implement a hard fallback for everything else. Then expand incrementally.

Build automated testing before writing call logic. Our team eventually built a dataset of 1,100 real call recordings that we run before every deployment. Once this testing system was in place, accuracy jumped from 60% to 80% in weeks. If we had built it on day one, we would have saved at least a month of regression cycles.

Build vs buy: when a custom voice agent makes sense

If you are making a few thousand calls per month with a straightforward script, use a platform. Retell AI, Vapi, and similar tools handle orchestration, scaling, and basic dialogue management at a fraction of the engineering cost.

If your use case involves any of the following, a custom pipeline becomes necessary:

You need to navigate complex IVR menus and company directories
You need custom classification models trained on your specific call data
You need strict control over call flow logic with phase-based dialogue management
You need to scale beyond 10,000 to 20,000 calls per month
You need deep CRM integration with structured data from post-call analysis

Our system required all five. That is why we built it from scratch.

Frequently asked questions

How long does it take to build an AI voice agent?

Our system reached production in approximately five months (October 2025 to mid-March 2026). The first two months covered PoC and MVP development. The remaining three months were consumed by accuracy tuning: reducing latency from 10 seconds to under 2.5 seconds, increasing classification accuracy from roughly 50% to 80%, and building the regression testing infrastructure. A previous team worked on the same project for months and did not reach production.

What is the best speech-to-text API for voice agents in 2026?

For real-time voice agents on telephony audio, Deepgram Nova-3 is our recommendation based on production testing. It provides streaming transcription at approximately 300 milliseconds latency, strong keyword boosting for proper nouns, and consistently lower word error rates than Whisper on call audio. Deepgram's published median WER is 5.26% on diverse enterprise datasets. On our cleaner test samples it scored 1.53%, while Whisper scored 6.82% on the same data.

How much does it cost to run an AI voice agent?

Our measured cost is approximately $0.50 to $0.70 per minute of call time. That includes STT (Deepgram), approximately 20 LLM inference calls per minute across Groq and OpenAI GPT, TTS (ElevenLabs), post-call analysis, and infrastructure. Platform vendors advertise $0.05 to $0.10 per minute, but that covers only orchestration, not the underlying AI provider costs.

Can AI voice agents navigate IVR phone menus?

Yes, but it is one of the hardest problems in the pipeline. Our system uses a fine-tuned BERT model for call type classification, pattern matching for extracting DTMF options from transcripts, and LLM reasoning for complex menu structures. Early versions would get trapped in IVR loops for over 10 minutes. Current versions use a supervisor mechanism that detects loops and forces corrective action or call termination.

What latency is acceptable for a voice agent?

Research on human conversational turn-taking (Stivers et al., 2009) places the natural gap between speakers at approximately 200 milliseconds. In production voice agents, the industry target is under 800 milliseconds total round-trip time. Above 1.5 seconds, callers begin to suspect automation. Our system operates at 1.5 to 2.5 seconds, with pre-scripted phrases achieving sub-500 millisecond response times for common interactions.

DEV Community