Priyanka

Posted on Apr 3

How a Voice AI call works: from dial to response

#ai #voip #machinelearning #webdev

Disclosure: This post contains affiliate links, including a link to Vapi. If you sign up for a paid plan through my link, I may earn a commission at no extra cost to you. I only recommend platforms I have personally evaluated. Full affiliate disclosure here. Home › Voice AI › How a Voice AI call works: from dial to response Technical Explainer

How a Voice AI call works: from dial to response

P Priyanka Senior Voice AI PM · April 3, 2026 · 9 min read · 1,800 words Voice AI SIP telephony Technical explainer The short answer

When a caller dials a Voice AI number, six distinct systems activate in sequence - SIP signalling, audio streaming, speech recognition, the language model, text-to-speech, and the return audio path. The entire round-trip must complete in under 500 milliseconds for the call to feel natural. This post walks through exactly what happens at each stage and where things go wrong.

Most people who work in Voice AI - including many PMs and developers - cannot fully explain what happens between the moment a caller dials a number and the moment the AI responds. They know the pieces exist: there is some speech recognition, there is a language model, there is a voice synthesis engine. But the sequence, the timing, and the failure points in between? That part stays murky.

That murkiness is expensive. When a Voice AI call fails - when the AI speaks over the caller, pauses too long, mishears a word, or drops the call entirely - the root cause is almost always in the handoff between one of these systems and the next. If you cannot picture the full call flow, you cannot diagnose the failure, set accurate latency expectations, or scope an integration correctly.

This post walks through the complete flow of a Voice AI call, stage by stage, from the moment a caller dials to the moment they hear a response. No assumed technical knowledge required.

6
systems involved in every call


&lt;500ms
target end-to-end latency


3
most common failure stages

Stage 1: The call enters the network (SIP signalling)

Everything begins with a phone number. When a caller dials a number connected to a Voice AI system, their call travels through the public telephone network - the PSTN - and arrives at a SIP trunk. SIP (Session Initiation Protocol) is the signalling layer that tells your Voice AI platform a call is incoming and negotiates how that call will be handled.

At this stage, three things happen almost simultaneously. The SIP trunk sends an INVITE message to the Voice AI platform, signalling that a call is waiting. The platform sends back a 100 Trying response to acknowledge receipt. Then, once the platform is ready to accept the call, it sends a 200 OK - and the caller is connected. The whole SIP handshake typically takes between 50 and 150 milliseconds. If it takes longer, the caller hears dead silence before anything happens, which immediately creates a bad first impression.

What can go wrong here

Codec mismatches are the most common failure at Stage 1. The SIP trunk and the Voice AI platform need to agree on how audio will be encoded - G.711, G.729, or Opus are the most common options. If they cannot agree, the call is rejected before it ever reaches the AI. Firewall rules that block SIP packets are the second most common cause of failures at this stage.

Stage 2: Audio streams to the platform (RTP)

Once the SIP handshake completes, the actual audio begins flowing over a separate protocol called RTP - Real-time Transport Protocol. RTP handles the continuous stream of compressed audio packets travelling in both directions: the caller's voice going in, the AI's voice coming back out.

The Voice AI platform receives this raw audio stream and immediately needs to do two things: detect when the caller has stopped speaking so it knows when to respond, and feed the captured audio into the speech recognition engine. The first task - called Voice Activity Detection, or VAD - is more complex than it sounds. Background noise, breathing, filler words like "um" or "er", and incomplete sentences all make it difficult to know with certainty whether the caller has finished their thought.

Getting VAD wrong is one of the most frustrating experiences in Voice AI. If the threshold is too aggressive, the AI interrupts the caller mid-sentence. If it is too passive, there is a long, awkward silence before the AI responds. Tuning VAD for different acoustic environments - a quiet office versus a busy call centre floor - is one of the more underestimated tasks in a Voice AI deployment.

Stage 3: Speech-to-text converts audio to words (STT)

Once the platform captures what it believes is a complete utterance from the caller, that audio is sent to a speech-to-text engine. The STT engine converts the raw audio waveform into a text transcript - the words the caller actually said. Common STT providers used in Voice AI platforms include Deepgram, OpenAI Whisper, Google Speech-to-Text, and AssemblyAI.

STT latency is one of the biggest variables in total call latency. A well-optimised STT pipeline running on dedicated infrastructure can return a transcript in 100 to 200 milliseconds. An over-loaded shared API can take 500 milliseconds or more - which, added to everything else in the pipeline, pushes total response time well past the point where the call feels natural.

Accuracy matters too. An STT engine that regularly mishears "cancel" as "cancel order" or confuses product names and account numbers creates downstream problems the LLM cannot easily recover from. Choosing an STT model trained on your specific domain - customer service, healthcare, finance - makes a measurable difference in both accuracy and latency.

From my experience

On one enterprise deployment I managed, our STT accuracy rate looked acceptable in testing - around 94% - but in production we discovered that the specific product names and account terminology our client used were being consistently mistranscribed. The LLM was generating polite, coherent responses to the wrong question. It took us two weeks to identify STT as the root cause because we were looking at LLM outputs rather than transcripts.

What I do now: I add a transcript logging step to every deployment from day one, and I review raw transcripts alongside call recordings for the first two weeks. You cannot improve what you cannot see.

Stage 4: The language model generates a response (LLM)

The transcript from the STT engine is passed to the language model - the LLM - along with the conversation history and a system prompt that defines how the AI agent should behave. The LLM reads all of this context and generates the next response: what the AI agent should say next.

LLM latency in Voice AI is measured differently to how most people think about it. In a chat interface, a response that streams in over two or three seconds feels fast. In a phone call, the same latency is a problem - silence on a phone line triggers anxiety in callers almost immediately. This is why Voice AI platforms use streaming LLM responses: the model begins generating text token by token, and the moment the first few words are available, they are immediately piped to the next stage - text-to-speech - rather than waiting for the full response to complete. This technique, called streaming or partial synthesis, is responsible for much of the latency improvement in modern Voice AI systems.

The LLM stage is also where function calls happen. If the caller asks to check an order status, book an appointment, or look up an account balance, the LLM generates a function call - a structured instruction to retrieve data from an external system. That API call takes additional time, and if it is slow, it adds directly to the caller's wait. Building fast fallback responses ("let me pull that up for you...") into the system prompt is a practical way to bridge the gap while the data loads.

"The LLM is the part of the stack that gets 90% of the attention in a demo. It is also the part that is rarely the bottleneck in production. The bottlenecks are almost always in the stages before and after it."

- Something I say in every Voice AI architecture review

Stage 5: Text-to-speech converts the response to audio (TTS)

The LLM's text output is passed - in streaming chunks - to a text-to-speech engine. The TTS engine converts the words into an audio waveform using a pre-trained voice model. Popular TTS providers in Voice AI include ElevenLabs, PlayHT, Cartesia, and OpenAI TTS. Each has different trade-offs across voice naturalness, language support, latency, and cost per character.

TTS quality has improved dramatically in the last two years. The gap between a synthetic voice and a human voice is now small enough that many callers do not notice it in normal conversation. The remaining tells are subtle: unnatural stress patterns on unusual words, slightly mechanical rhythm on long technical strings like reference numbers, and occasional mispronunciation of domain-specific terminology. Good TTS configuration - phoneme corrections, SSML markup for pauses and emphasis, and voice selection testing - closes most of these gaps.

TTS latency - the time between receiving text and returning audio - is typically 100 to 300 milliseconds for a well-optimised pipeline. Like the STT stage, this varies significantly under load and between providers. The streaming approach mentioned in Stage 4 is particularly important here: the first audio chunk can begin playing to the caller while the LLM is still generating the rest of the sentence, shaving 200 to 400 milliseconds off perceived latency.

Stage 6: Audio returns to the caller

The audio generated by the TTS engine is encoded and sent back over the RTP stream to the caller's phone. This is the reverse of Stage 2 - the same real-time audio transport protocol, now carrying the AI's voice back out to the PSTN and through to the caller's handset.

Network jitter - small irregularities in packet arrival timing - can cause the caller's audio to sound choppy or robotic at this stage, even if everything earlier in the pipeline worked perfectly. This is why Voice AI platforms use jitter buffers: small queues that smooth out irregular packet delivery by introducing a tiny, controlled delay. Too small a buffer and audio is choppy; too large and latency increases.

The call then continues in a loop - the caller speaks again, and the entire six-stage process repeats for every turn of the conversation. A typical customer service call might involve eight to twelve full turns. That is eight to twelve complete passes through the entire pipeline, each one needing to complete in under 500 milliseconds to feel natural throughout.

Typical latency budget per turn

Stage	Target	Problem zone
SIP handshake	50–150ms	>300ms
VAD (end of speech detection)	20–80ms	>200ms
Speech-to-text (STT)	100–200ms	>500ms
LLM first token (streaming)	100–250ms	>600ms
Text-to-speech (first chunk)	100–300ms	>500ms
Total end-to-end	<500ms	>800ms

Why understanding the call flow matters for PMs and buyers

Most Voice AI failures are blamed on the wrong component. A call that sounds choppy is blamed on the AI model. A long pause before a response is attributed to the LLM being "slow". A caller who says "the AI didn't understand me" triggers a conversation about LLM context windows. In reality, all three of these symptoms are more often caused by STT accuracy issues, network jitter, or VAD misconfiguration respectively.

If you understand the call flow, you can ask the right diagnostic questions. When a client says "the AI sounds unnatural", you now know to check: Is it the TTS voice selection? Is it SSML configuration? Is it audio codec degradation in the RTP stream? Is it latency making pauses sound mechanical? Each of those has a different fix, and getting to the right one quickly is what separates a PM who can manage these projects from one who cannot.

For buyers evaluating Voice AI platforms, this architecture also gives you sharper questions for vendor demos. Ask: what STT providers do you support and what are the latency benchmarks? How does your platform handle VAD tuning for different acoustic environments? Can I bring my own SIP trunk, or am I locked into your carrier? What is your end-to-end latency P95 - meaning the 95th percentile, not the average? These questions reveal far more about platform quality than any demo script.

Platform I recommend for Voice AI deployments

  V


  Vapi - Voice AI Platform
  Full pipeline control  ·  Bring your own STT / LLM / TTS  ·  Native SIP  ·  Real-time latency &lt;500ms  ·  Pay per minute
  Vapi is the platform I recommend when a team needs visibility and control over each stage of the call pipeline. You can swap in your own STT provider, LLM, and TTS engine independently - which means you can optimise each stage for latency or accuracy without being locked into one vendor's choices. For teams who have read this post and want to tune the pipeline themselves rather than accept default settings, Vapi is the right starting point.
  <a href="https://vapi.ai?via=priyanka" rel="nofollow sponsored">Try Vapi free</a>
  <span>affiliate link</span>

What to do with this knowledge

Every Voice AI project goes better when the team understands the full call flow before they start building. Not because everyone needs to be a telephony engineer, but because a shared mental model of what happens between dial and response makes every conversation - with engineers, with clients, with vendors - more precise and more productive.

The next time a Voice AI call does not behave the way you expected, walk through the six stages in order. Is the SIP handshake completing cleanly? Is VAD triggering at the right moment? Is the STT transcript accurate? Is the LLM receiving the right context? Is TTS returning audio fast enough? Is the RTP stream stable? One of those questions will point you to the answer faster than any amount of staring at LLM outputs.

The call flow is not magic. It is six systems working in sequence, each one measurable and improvable. Once you can see it clearly, you can build it well.

Want more plain-English Voice AI guides?

I publish every week on Voice AI platforms, SIP telephony, and what it actually looks like to ship these systems in production - written from inside the industry, not from a distance.

<a href="https://www.voiceaipm.com/p/about-voice-ai-insider-by-priyanka.html">About this blog</a>
<a href="https://www.voiceaipm.com/p/contact-voice-ai-insider.html">Get in touch</a>







Join this blog
Follow Voice AI Insider on Blogger
<p>Follow with your Google account and get new posts in your Blogger reading list automatically.</p>


<a href="https://www.blogger.com/follow-blog.g?blogID=4047905531325304643" rel="noopener">Follow this blog</a>

DEV Community

How a Voice AI call works: from dial to response

How a Voice AI call works: from dial to response

Stage 1: The call enters the network (SIP signalling)

Stage 2: Audio streams to the platform (RTP)

Stage 3: Speech-to-text converts audio to words (STT)

Stage 4: The language model generates a response (LLM)

Stage 5: Text-to-speech converts the response to audio (TTS)

Stage 6: Audio returns to the caller

Why understanding the call flow matters for PMs and buyers

What to do with this knowledge

Want more plain-English Voice AI guides?

Top comments (0)