Designing Robust Applications Over UDP — Building AI Voice Assistants with LiveKit

#udp #aivoiceassistant #livekit #webrtc

🔙 Previously: Real-World UDP: How WebRTC and DNS Use the Fast but Unreliable Protocol

In our last post, we saw how technologies like WebRTC (used in video calls) and DNS (used to find websites) depend on UDP, a fast but unreliable network protocol. Even though UDP doesn’t guarantee delivery or order, it powers some of the most important real-time systems we use every day.

In this final post of the series, we’ll bring all of that learning into action by building a real-time AI voice assistant using LiveKit — an open-source platform built on WebRTC and UDP.

Imagine talking to a smart assistant during a live call, or asking a virtual agent for help while driving — those systems need to hear you, understand you, and respond quickly, even when the network is unstable.

To make that possible, developers must design carefully over UDP, using smart techniques like buffering, retries, and low-latency media pipelines. We’ll walk through how LiveKit’s agents SDK makes this easier — and how UDP, when combined with the right tools, can help you build robust and responsive voice applications.

UDP Challenges in Voice Applications

UDP is chosen for low-latency communication, but it comes with some trade-offs:

No retransmission: Lost packets aren’t automatically resent.
No ordering: Packets can arrive out of order.
No delivery guarantee: Voice data can be dropped during transmission.

These are big issues for voice-based AI systems. Imagine missing part of a user’s request: “Send an email to John” might become “…email to John” or just “Send an…”

So how do we make it work?

Enter LiveKit's Agents SDK
LiveKit provides a WebRTC-based media layer over UDP. Their Agents SDK allows you to plug in a Python or Node.js program into a LiveKit room as a full participant. This makes it easier to:

Receive real-time audio
Process it with ASR (Automatic Speech Recognition)
Pass it to an LLM for reasoning
Synthesize speech (TTS)
Publish the result back into the room

Here’s a high-level look at the pipeline:

User Voice → STT (speech-to-text) → LLM → TTS (Text-to-speech) → Voice Response

And yes — it’s all built on streaming media over UDP!

Real-Time Voice Agent Stack

Each component of the voice assistant stack adds latency. Here’s what we used:

VAD (Voice Activity Detection): Detect presence/absence of human speech in audio.
EOU (End of Turn / Utterance Detection): Detect when a speaker is done.
STT (Speech-to-Text): Convert audio to text
LLM Agent: Understand and generate replies
TTS (Text-to-Speech): Speak the response back LiveKit supports streaming APIs for each of these, enabling low-latency conversation loops.

⏳ Human users expect a reply within ~236ms on average. Keeping latency low is key.

Optimizing for Latency

Based on RFC 5405 and LiveKit’s architecture, here are best practices:

Use streaming APIs: Avoid waiting for full input before processing
Shrink model size: Use quantized LLMs for faster token generation
Asynchronous flows: Handle audio, text, and output in parallel
Retry logic: Implement intelligent buffering and fallback on packet loss
Shorten replies: Use LLM prompting to create faster first responses

Full Example: LiveKit Voice Agent with Python
Here’s a Python-based voice assistant using LiveKit’s agents SDK and Google's streaming LLM:

from livekit import agents
from livekit.agents import AgentSession, Agent, RoomInputOptions
from livekit.plugins import google, noise_cancellation
from tools import get_weather, search_web, send_email

class Assistant(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful assistant.",
            llm=google.beta.realtime.RealtimeModel(voice="Charon", temperature=0.8),
            tools=[get_weather, search_web, send_email]
        )

async def entrypoint(ctx: agents.JobContext):
    session = AgentSession()
    await session.start(
        room=ctx.room,
        agent=Assistant(),
        room_input_options=RoomInputOptions(
            video_enabled=True,
            noise_cancellation=noise_cancellation.BVC(),
        ),
    )
    await ctx.connect()
    await session.generate_reply(instructions="Welcome to the AI voice assistant!")

if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

Assistant class: This defines the AI logic. It uses Google’s real-time model to generate voice responses and supports external tools like get_weather, search_web, and send_email.
AgentSession: Manages the life cycle of the agent inside a LiveKit room.
RoomInputOptions: Enables noise cancellation and video if needed.
generate_reply(): Produces a spoken welcome message when the agent connects.
agents.cli.run_app(): Starts the agent as a background worker. This setup makes the assistant act like a human participant in a LiveKit room — joining, listening, responding in real time.

How UDP Makes It All Work

Speed over reliability: For voice, it’s better to get most of the message fast than all of it late.
Adaptive flows: Tools like LiveKit and WebRTC adjust bitrate and buffering based on network quality.
Packet loss tolerance: Voice codecs like Opus can recover gracefully from missing packets.
Minimal overhead: UDP avoids the heavy handshakes of TCP, ideal for real-time communication.

Final Thoughts

With the help of LiveKit, WebRTC, and a well-designed UDP stack, we can now build robust, real-time AI voice agents that feel natural to talk to. Whether it’s a customer support bot, an in-car assistant, or a real-time translator — these systems can work reliably, even over an unreliable network.

UDP may not guarantee delivery, but with the right architecture and tooling, you can still deliver great experiences.

Thanks for following along with this series!