What Makes Real-Time Voice AI agents Feel Real

Mohammad — Sat, 06 Sep 2025 11:41:43 +0000

She interrupted me.

Mid sentence.

And weirdly… I loved it.

Not because I enjoy being cut off, but because for the first time, an AI assistant felt human enough to jump into the conversation.

That’s the magic of real-time voice AI.

The Story Behind the Silence

Turn-Based Voice AI feels like a classroom: you speak, then the AI waits….. silently….. until you’re done. Only then does it think, respond, and speak. Predictable… but awkward.
Real-Time Voice AI, however, listens and responds as you speak. It interrupts to clarify, builds anticipation, and makes the interaction feel alive. It’s not just hearing you it’s conversing with you.

What Makes Real-Time Feel Real?

Component	Turn-Based Flow	Real-Time Flow
STT	Waits for full sentence before transcribing.	Streams partial transcriptions (chunks) on the fly.
LLM	Starts after transcription completes.	Begins processing as soon as partial input arrives.
TTS	Generates full output before speaking.	Speaks as soon as first tokens are ready.
UX	Delayed, segmented.	Smooth, conversational, anticipatory.

But under the hood? It’s orchestration chaos managing barge in detection, aligning streams, handling interruptions, and keeping latency under 1 second.

When to Pick Which?

Turn-Based (classic STT → LLM → TTS pipeline):

✅ Easier to build and debug.

❌ Feels robotic with 0.7 to 3s delays.
Real-Time (Speech-to-Speech):

✅ Natural, fluid, human like.

❌ Architecturally complex, less modular.

In Practice

Modern systems still rely on STT → NLP → TTS, but optimized with:

Streaming ASR (<300 ms)
Low-latency inference (<500 ms)
Chunked TTS (<200 ms to first audio)

Done right, the whole pipeline feels instant.

TL;DR

Turn-based AI listens.

Real-time AI converses.

And that tiny shift from waiting to weaving makes the difference between talking to a machine and talking with one.

I write more such blogs here blog

How I Built a Talking, Knowledgeable AI Sidekick (and How You Can too build a Voice AI RAG agent )

Mohammad — Fri, 25 Jul 2025 10:40:26 +0000

A Story of Code and a Chatty Voice AI agent That Actually Knows Stuff from your docs

Chapter 1: The Dream
It all started on a rainy afternoon. I was talking to my computer (as one does when he is a remote worker), and realized:
Wouldn’t it be cool if my computer could actually listen, understand, and answer me with real knowledge from my own files?
Not just “Hey Siri, what’s the weather?” but “Hey AI, what’s in my project docs?” or “Remind me what the HR policy says about bringing cats to work?”
And so, the quest began:
I would build a Voice AI RAG agent!
(That’s Retrieval-Augmented Generation, but let’s just call it “RAG” because it sounds like a pirate.)
Chapter 2: The Ingredients
Before you can summon your own digital sidekick, you’ll need a few magical artifacts:

Python 3.11+ (the spellbook)

Cartesia (for making your AI talk like a human, not a fax machine)

AssemblyAI (so your AI can understand your voice, even if you mumble)

Anthropic Claude (the brain—OpenAI is cool, but Claude is the new wizard in town)

LiveKit (for real-time voice rooms, so your AI can join you in a virtual “room”)

A pile of your own documents (so your AI knows your world)

API keys (the secret runes—don’t lose them!)

Chapter 3: The Spell (a.k.a. The Code)
Here’s the full incantation. Don’t worry, I’ll explain every part after you read it.
(Copy, paste, and prepare to be amazed!)

import logging
import os
from dotenv import load_dotenv
from livekit.agents import JobContext, JobProcess, WorkerOptions, cli
from livekit.agents.job import AutoSubscribe
from livekit.agents.llm import (
ChatContext,
)
from livekit.agents.pipeline import VoicePipelineAgent
from livekit.plugins import cartesia, silero, llama_index, assemblyai

load_dotenv()

logger = logging.getLogger("voice-assistant")
from llama_index.llms.anthropic import Anthropic
from llama_index.core import (
SimpleDirectoryReader,
StorageContext,
VectorStoreIndex,
load_index_from_storage,
Settings
)
from llama_index.core.chat_engine.types import ChatMode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

load_dotenv()

Set up the embedding model and LLM

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = Anthropic(model="claude-3-haiku-20240307", max_tokens=512)
Settings.llm = llm
Settings.embed_model = embed_model

check if storage already exists

PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
# load the documents and create the index
documents = SimpleDirectoryReader("docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# store it for later
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# load the existing index
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)

def prewarm(proc: JobProcess):
proc.userdata["vad"] = silero.VAD.load()

async def entrypoint(ctx: JobContext):

chat_context = ChatContext().append(
    role="system",
    text=(
        "You are a funny, witty assistant."
        "Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
    ),
)

chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)

logger.info(f"Connecting to room {ctx.room.name}")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

participant = await ctx.wait_for_participant()
logger.info(f"Starting voice assistant for participant {participant.identity}")

agent = VoicePipelineAgent(
    vad=ctx.proc.userdata["vad"],
    stt=assemblyai.STT(),
    llm=llama_index.LLM(chat_engine=chat_engine),
    tts=cartesia.TTS(
        model="sonic-2",
        voice="bf0a246a-8642-498a-9950-80c35e9276b5",
    ),
    chat_ctx=chat_context,
)

agent.start(ctx.room, participant)

await agent.say(
    "Hey there! How can I help you today?",
    allow_interruptions=True,
)

if name == "main":
print("Starting voice agent with Anthropic...")
cli.run_app(
WorkerOptions(
entrypoint_fnc=entrypoint,
prewarm_fnc=prewarm,
),
)

Chapter 4: The Magic Explained
Let’s break down this spellbook, line by line:

Imports and Setup
We import all the libraries:

livekit for voice rooms

cartesia for text-to-speech

assemblyai for speech-to-text

llama_index for RAG (so your AI can actually know things from your docs)

Anthropic for the LLM (the brain)

We also load environment variables with dotenv—because hardcoding API keys is a rookie mistake.

Embeddings and LLM

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = Anthropic(model="claude-3-haiku-20240307", max_tokens=512)
Settings.llm = llm
Settings.embed_model = embed_model

The embedding model turns your docs into “AI food” (vectors).

The LLM (Claude) is the brain that answers questions using those vectors.

Document Indexing

PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
documents = SimpleDirectoryReader("docs").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)

If you haven’t indexed your docs before, it reads everything in docs/ and builds a knowledge base.

If you have, it loads the existing index (so it doesn’t have to re-read your 500-page PDF every time).

Voice Activity Detection (VAD)

def prewarm(proc: JobProcess):
proc.userdata["vad"] = silero.VAD.load()

This makes sure your AI only listens when you’re actually talking, not when you’re yelling at your cat.

The Entrypoint: Where the Magic Happens

async def entrypoint(ctx: JobContext):
chat_context = ChatContext().append(
role="system",
text=(
"You are a funny, witty assistant."
"Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
),
)
chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)
...

Sets the “personality” of your AI (witty, concise, no weird punctuation).

Prepares the chat engine with your indexed docs.

logger.info(f"Connecting to room {ctx.room.name}")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
participant = await ctx.wait_for_participant()
logger.info(f"Starting voice assistant for participant {participant.identity}")

Connects to a LiveKit room (more on this soon).

Waits for a participant (that’s you!) to join.

agent = VoicePipelineAgent(
    vad=ctx.proc.userdata["vad"],
    stt=assemblyai.STT(),
    llm=llama_index.LLM(chat_engine=chat_engine),
    tts=cartesia.TTS(
        model="sonic-2",
        voice="bf0a246a-8642-498a-9950-80c35e9276b5",
    ),
    chat_ctx=chat_context,
)
agent.start(ctx.room, participant)
await agent.say(
    "Hey there! How can I help you today?",
    allow_interruptions=True,
)

Sets up the full voice pipeline: listens, understands, thinks, and talks back.

Greets you with a friendly message.

The Main Event

if name == "main":
print("Starting voice agent with Anthropic...")
cli.run_app(
WorkerOptions(
entrypoint_fnc=entrypoint,
prewarm_fnc=prewarm,
),
)

Chapter 5: Summoning Your AI (a.k.a. Running the Code)

Install your dependencies (see requirements.txt).

Put your API keys in a .env file:

ANTHROPIC_API_KEY=your_anthropic_key
ASSEMBLYAI_API_KEY=your_assemblyai_key
CARTESIA_API_KEY=your_cartesia_api_key
LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret

Add your documents to the docs/ folder.

Run

python voice_agent_anthropic.py start

Chapter 6: Entering the LiveKit Room

What’s a LiveKit room?

Think of it as a virtual meeting room where your AI is always waiting for you.

How do you join?

Use the LiveKit Playground or your own LiveKit client, enter your room name, and your AI will greet you like an old friend (who actually remembers your last conversation).
Chapter 7: The Result
Now, you can:

Talk to your AI: Ask questions, get answers from your own docs.

Get witty, concise responses: No more boring bots!

Impress your friends: “Yeah, my Voice AI actually knows what’s in my files.”

if you have any doubts you can contact me 🙂
Happy coding! 🚀

DEV Community: Mohammad