Chanchal Singh

Posted on May 9 • Edited on May 28

Building AI Voice Agents for Production

#ai #agentaichallenge #elevenlabs #llm

Voice-based agents are becoming a powerful way for users to interact with AI in a more human and intuitive way. I'm currently exploring how to build production-ready AI voice agents, and I’m excited to share some key practices and insights here.

We began by developing a conversational agent capable of choosing context-aware responses — essentially, making it say things similar to how I would respond in different scenarios. This was made possible through an Agentic workflow that helps drive the decision-making logic of the system.

Then we added:

Speech-to-Text to understand users
Agentic workflow to decide what to say
Text-to-Speech to speak in my own voice.

Building Blocks of a Voice Agent

Voice Activity Detection (VAD) : Detects when a user is speaking, ensuring the system know when to pay attention.
End-of-Turn Detection : Detect when the user is done speaking. This is tricky - humans often pause mis-sentence.
Speech-to-Text (STT) : Transcribes the spoken words into text for the agent to process.
LLM/ Agentic Workflow : Processes the text to decide what the AI should say next. Often powered by GPT-like models.
Text-to-Speech (TTS) : Converts the AI's response from text back into speech for the users to hear it.

Why is Latency so important?

Latency refers to the delay between a user's action and the system's response.

Specially in the voice conversation, 1 second lag can make it feel awkward and unnatural. This is why, it is super important to minimise the latency.

To deal with latency issue, making the communication smooth and almost real-time, technologies like LiveKit and WebRTC comes in picture.

The Core Challenge

In voice interactions, latency can significantly impact user experience. The goal is to achieve:

Low Latency: Quick responses to maintain conversational flow.
High Reliability : Accurate and complete data transmission.

Achieving both simultaneously is challenging, necessitating a careful selection of communication protocols.

Networking Protocols : TCP vs. UDP

TCP (Transmission Control Protocol) : It ensures all data packets arrive and are in order. Thus, it is not ideal for real-time voice applications where speed is paramount.

UDP (User Datagram Protocol) : It delivers packets quickly without waiting for all to arrive. Some packets might get lost or arrive out of order. It is better suited for real-time applications like voice agents, where occasional data loss is acceptable for faster responses.

How to make UDP Usable?

UDP is like a fast but no-rules delivery person.
While UDP offers speed,but it is low-level and complex to implement directly. Higher level protocols built on UDP provide more accessible solutions.

HTTP : Built on TCP, not ideal for continuous data streaming.
WebSocket : Also built on TCP, better than HTTP for real-time communication but still inherits TCP's latency issues.
WebRTC (Web Real-Time Communication) : Built on UDP, designed for real-time audio and video communication.
It offers low-latency, peer-to-peer communication, making it an ideal choice for voice agents.

Leveraging LiveKit for Simplified Integration

Implementing WebRTC can be complex, but tools like LiveKit simplify the process :

Client-Side: Provide SDKs for various platforms to establish persistant connections and stream audio seamlessly.
Server-Side: Offers frameworks to build and scale AI voice agents efficiently.

By integrating LiveKit, developers can focus on building the voice-agent's capabilities without getting bogged down by the intricacies of real-time communication protocols.

Understanding Turn-taking in Conversations

In human interactions, we instinctively know when to speak and when to listen. Replicating this in AI voice agents is challenging.

The system must determine:

When the user has finished speaking, to avoid interrupting or responding too early.
When the user is pausing, to prevent cutting off mid-thought.

To address this, the architecture incorporates:

Voice Activity Detection (VAD) : Monitors audio signals to detect the presence or absence of speech.
Semantic Turn Detection : Analyzes the content of the speech to understand if the user has completed their thought.

These components work together to ensure the AI responds at appropriate times, enhancing the fluidity of the conversation.

Handling Interruptions Gracefully

Conversations are dynamic, and users may interrupt the AI for various reasons. The system is designed to :

Detect interruptions using VAD
Halt the current response generation process.
Clear any queued responses.
Allow the user to take control of the conversation seamlessly.

This ensures that the AI remains responsive and adaptable to the user's needs.

Modular System Design

This architecture is modular, meaning each component is independent but works together.
This allows flexibility, if swapping or upgrading parts without rebuilding the whole system and scalability, easy to scale for more users by replicating or expanding individual modules.

Key components:

STT - converts spoken audio into text.
NLP- Understands the text and decides how to respond.
TTS- Converts the agent's reply back to speech.
VAD- Helps detect when a user starts/stops speaking.
Interrupt Handler - Stops responses if the user interrupts.

LiveKit- Real-Time Infrastructure

The system needs to work live, thus, LiveKit is integrated to handle real-time audio streaming, ensuring low latency and high quality communication with managing audio routing between the users and agents smoothly.

Two Main Architectures : Pipeline vs. Speech-to-Speech

Aspect	Pipeline Approach	Speech-to-Speech (Real-Time) Approach
Structure	STT → LLM → TTS (Sequential Processing)	End-to-End Speech-to-Speech Conversion
Advantages	- Greater control over components - Easier to debug and optimize	- Simpler to implement - More natural-sounding interactions
Trade-offs	- Higher latency due to multiple steps - More complex integration	- Less transparency - Difficult to customize/intervene in specific stages

Balancing Trade-offs: Latency, Quality and Cost

Designing an effective voice agent involves making strategic decisions:

Latency : Critical for real-time interactions, lower latency enhances user experience.
Quality : High-quality responses require advanced models, which may increase processing time.
Cost: More sophisticated models and infrastructure can lead to higher operational costs.

Latency Benchmarks

Achieving these benchmarks ensures a responsive and natural conversational experience.

Component	Target Latency
VAD	~15-20 ms
STT	Real-time
LLM	< 300 ms
TTS	< 200 ms

Conclusion

Learning how to build AI voice agents has been a fascinating dive into the real-world mechanics behind voice-based systems. From understanding the key components like STT, LLM and TTS to comparing pipeline and end-to-end architectures. Finally. realised how much goes into making conversations feel natural. Optimising for low latency is crucial to ensure smooth, real-time interactions.

Top comments (3)

Alex Carter • May 15

Really enjoyed this — lots of practical nuggets here that folks often gloss over, especially around handling interruptions and fallback flows. Production-grade voice agents aren’t just about good intent matching — they’re about building for the messy, real-world edge cases. If you're diving deeper into this space, RaftLabs has a great guide on designing AI voice agents with real examples around architecture and latency handling. Between this post and that one, you've got a solid blueprint for doing it right.

Dev community • May 17

Detailed walkthrough

Fluents • Sep 23

Some ideas: what I have seen work is to stream everything: partial ASR with adaptive endpointing, early tokens from the LLM, and incremental TTS so the first phonemes land in under 150 ms; on the transport side, WebRTC over UDP with Opus FEC/DTX and a small jitter buffer consistently beats TCP for conversational feel. At Fluents we also layer barge-in and short backchannels to mask model latency - are you doing client-side VAD with server confirmation, and how are you handling packet loss in practice (NACKs only or RTP redundancy)?