<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohammad</title>
    <description>The latest articles on DEV Community by Mohammad (@mohammad_palla).</description>
    <link>https://dev.to/mohammad_palla</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3387529%2F9fbe3362-a9a2-43bf-91c5-7d2550a600b4.jpeg</url>
      <title>DEV Community: Mohammad</title>
      <link>https://dev.to/mohammad_palla</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mohammad_palla"/>
    <language>en</language>
    <item>
      <title>What Makes Real-Time Voice AI agents Feel Real</title>
      <dc:creator>Mohammad</dc:creator>
      <pubDate>Sat, 06 Sep 2025 11:41:43 +0000</pubDate>
      <link>https://dev.to/mohammad_palla/what-makes-real-time-voice-ai-agents-feel-real-22j2</link>
      <guid>https://dev.to/mohammad_palla/what-makes-real-time-voice-ai-agents-feel-real-22j2</guid>
      <description>&lt;p&gt;&lt;strong&gt;She interrupted me.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mid sentence.&lt;/p&gt;

&lt;p&gt;And weirdly… I loved it.&lt;/p&gt;

&lt;p&gt;Not because I enjoy being cut off, but because for the first time, an AI assistant felt &lt;strong&gt;human&lt;/strong&gt; enough to jump into the conversation.&lt;/p&gt;

&lt;p&gt;That’s the magic of &lt;strong&gt;real-time voice AI.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  The Story Behind the Silence
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Turn-Based Voice AI&lt;/strong&gt; feels like a classroom: you speak, then the AI waits….. silently….. until you’re done. Only then does it think, respond, and speak. Predictable… but awkward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Voice AI&lt;/strong&gt;, however, listens and responds &lt;em&gt;as you speak&lt;/em&gt;. It interrupts to clarify, builds anticipation, and makes the interaction feel alive. It’s not just hearing you it’s &lt;em&gt;conversing&lt;/em&gt; with you.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  What Makes Real-Time Feel Real?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Turn-Based Flow&lt;/th&gt;
&lt;th&gt;Real-Time Flow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;STT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Waits for full sentence before transcribing.&lt;/td&gt;
&lt;td&gt;Streams partial transcriptions (chunks) on the fly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Starts after transcription completes.&lt;/td&gt;
&lt;td&gt;Begins processing as soon as partial input arrives.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TTS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generates full output before speaking.&lt;/td&gt;
&lt;td&gt;Speaks as soon as first tokens are ready.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UX&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Delayed, segmented.&lt;/td&gt;
&lt;td&gt;Smooth, conversational, anticipatory.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But under the hood? It’s &lt;strong&gt;orchestration chaos&lt;/strong&gt; managing barge in detection, aligning streams, handling interruptions, and keeping latency under 1 second.&lt;/p&gt;




&lt;h3&gt;
  
  
  When to Pick Which?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Turn-Based&lt;/strong&gt; (classic STT → LLM → TTS pipeline):&lt;/p&gt;

&lt;p&gt;✅ Easier to build and debug.&lt;/p&gt;

&lt;p&gt;❌ Feels robotic with 0.7 to 3s delays.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Real-Time (Speech-to-Speech)&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;✅ Natural, fluid, human like.&lt;/p&gt;

&lt;p&gt;❌ Architecturally complex, less modular.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  In Practice
&lt;/h3&gt;

&lt;p&gt;Modern systems still rely on &lt;strong&gt;STT → NLP → TTS&lt;/strong&gt;, but optimized with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming ASR&lt;/strong&gt; (&amp;lt;300 ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low-latency inference&lt;/strong&gt; (&amp;lt;500 ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunked TTS&lt;/strong&gt; (&amp;lt;200 ms to first audio)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Done right, the whole pipeline feels instant.&lt;/p&gt;




&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;p&gt;Turn-based AI listens.&lt;/p&gt;

&lt;p&gt;Real-time AI &lt;em&gt;converses&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;And that tiny shift from waiting to weaving makes the difference between talking &lt;em&gt;to&lt;/em&gt; a machine and talking &lt;em&gt;with&lt;/em&gt; one.&lt;/p&gt;

&lt;p&gt;I write more such blogs here &lt;a href="https://blog.palla.co.in" rel="noopener noreferrer"&gt;blog&lt;/a&gt;&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>chatgpt</category>
      <category>learning</category>
    </item>
    <item>
      <title>How I Built a Talking, Knowledgeable AI Sidekick (and How You Can too build a Voice AI RAG agent )</title>
      <dc:creator>Mohammad</dc:creator>
      <pubDate>Fri, 25 Jul 2025 10:40:26 +0000</pubDate>
      <link>https://dev.to/mohammad_palla/how-i-built-a-talking-knowledgeable-ai-sidekick-and-how-you-can-too-build-a-voice-ai-rag-agent--53k0</link>
      <guid>https://dev.to/mohammad_palla/how-i-built-a-talking-knowledgeable-ai-sidekick-and-how-you-can-too-build-a-voice-ai-rag-agent--53k0</guid>
      <description>&lt;h2&gt;
  
  
  A Story of Code and a Chatty Voice AI agent That Actually Knows Stuff from your docs
&lt;/h2&gt;

&lt;p&gt;Chapter 1: The Dream&lt;br&gt;
It all started on a rainy afternoon. I was talking to my computer (as one does when he is a remote worker), and realized:&lt;br&gt;
Wouldn’t it be cool if my computer could actually listen, understand, and answer me with real knowledge from my own files?&lt;br&gt;
Not just “Hey Siri, what’s the weather?” but “Hey AI, what’s in my project docs?” or “Remind me what the HR policy says about bringing cats to work?”&lt;br&gt;
And so, the quest began:&lt;br&gt;
I would build a Voice AI RAG agent!&lt;br&gt;
(That’s Retrieval-Augmented Generation, but let’s just call it “RAG” because it sounds like a pirate.)&lt;br&gt;
Chapter 2: The Ingredients&lt;br&gt;
Before you can summon your own digital sidekick, you’ll need a few magical artifacts:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python 3.11+ (the spellbook)

Cartesia (for making your AI talk like a human, not a fax machine)

AssemblyAI (so your AI can understand your voice, even if you mumble)

Anthropic Claude (the brain—OpenAI is cool, but Claude is the new wizard in town)

LiveKit (for real-time voice rooms, so your AI can join you in a virtual “room”)

A pile of your own documents (so your AI knows your world)

API keys (the secret runes—don’t lose them!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Chapter 3: The Spell (a.k.a. The Code)&lt;br&gt;
Here’s the full incantation. Don’t worry, I’ll explain every part after you read it.&lt;br&gt;
(Copy, paste, and prepare to be amazed!)&lt;/p&gt;

&lt;p&gt;import logging&lt;br&gt;
import os&lt;br&gt;
from dotenv import load_dotenv&lt;br&gt;
from livekit.agents import JobContext, JobProcess, WorkerOptions, cli&lt;br&gt;
from livekit.agents.job import AutoSubscribe&lt;br&gt;
from livekit.agents.llm import (&lt;br&gt;
    ChatContext,&lt;br&gt;
)&lt;br&gt;
from livekit.agents.pipeline import VoicePipelineAgent&lt;br&gt;
from livekit.plugins import cartesia, silero, llama_index, assemblyai&lt;/p&gt;

&lt;p&gt;load_dotenv()&lt;/p&gt;

&lt;p&gt;logger = logging.getLogger("voice-assistant")&lt;br&gt;
from llama_index.llms.anthropic import Anthropic&lt;br&gt;
from llama_index.core import (&lt;br&gt;
    SimpleDirectoryReader,&lt;br&gt;
    StorageContext,&lt;br&gt;
    VectorStoreIndex,&lt;br&gt;
    load_index_from_storage,&lt;br&gt;
    Settings&lt;br&gt;
)&lt;br&gt;
from llama_index.core.chat_engine.types import ChatMode&lt;br&gt;
from llama_index.embeddings.huggingface import HuggingFaceEmbedding&lt;/p&gt;

&lt;p&gt;load_dotenv()&lt;/p&gt;

&lt;h1&gt;
  
  
  Set up the embedding model and LLM
&lt;/h1&gt;

&lt;p&gt;embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")&lt;br&gt;
llm = Anthropic(model="claude-3-haiku-20240307", max_tokens=512)&lt;br&gt;
Settings.llm = llm&lt;br&gt;
Settings.embed_model = embed_model&lt;/p&gt;

&lt;h1&gt;
  
  
  check if storage already exists
&lt;/h1&gt;

&lt;p&gt;PERSIST_DIR = "./chat-engine-storage"&lt;br&gt;
if not os.path.exists(PERSIST_DIR):&lt;br&gt;
    # load the documents and create the index&lt;br&gt;
    documents = SimpleDirectoryReader("docs").load_data()&lt;br&gt;
    index = VectorStoreIndex.from_documents(documents)&lt;br&gt;
    # store it for later&lt;br&gt;
    index.storage_context.persist(persist_dir=PERSIST_DIR)&lt;br&gt;
else:&lt;br&gt;
    # load the existing index&lt;br&gt;
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)&lt;br&gt;
    index = load_index_from_storage(storage_context)&lt;/p&gt;

&lt;p&gt;def prewarm(proc: JobProcess):&lt;br&gt;
    proc.userdata["vad"] = silero.VAD.load()&lt;/p&gt;

&lt;p&gt;async def entrypoint(ctx: JobContext):&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chat_context = ChatContext().append(
    role="system",
    text=(
        "You are a funny, witty assistant."
        "Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
    ),
)

chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)

logger.info(f"Connecting to room {ctx.room.name}")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

participant = await ctx.wait_for_participant()
logger.info(f"Starting voice assistant for participant {participant.identity}")

agent = VoicePipelineAgent(
    vad=ctx.proc.userdata["vad"],
    stt=assemblyai.STT(),
    llm=llama_index.LLM(chat_engine=chat_engine),
    tts=cartesia.TTS(
        model="sonic-2",
        voice="bf0a246a-8642-498a-9950-80c35e9276b5",
    ),
    chat_ctx=chat_context,
)

agent.start(ctx.room, participant)

await agent.say(
    "Hey there! How can I help you today?",
    allow_interruptions=True,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;if &lt;strong&gt;name&lt;/strong&gt; == "&lt;strong&gt;main&lt;/strong&gt;":&lt;br&gt;
    print("Starting voice agent with Anthropic...")&lt;br&gt;
    cli.run_app(&lt;br&gt;
        WorkerOptions(&lt;br&gt;
            entrypoint_fnc=entrypoint,&lt;br&gt;
            prewarm_fnc=prewarm,&lt;br&gt;
        ),&lt;br&gt;
    ) &lt;/p&gt;

&lt;p&gt;Chapter 4: The Magic Explained&lt;br&gt;
Let’s break down this spellbook, line by line:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Imports and Setup&lt;br&gt;
We import all the libraries:&lt;/p&gt;

&lt;p&gt;livekit for voice rooms&lt;/p&gt;

&lt;p&gt;cartesia for text-to-speech&lt;/p&gt;

&lt;p&gt;assemblyai for speech-to-text&lt;/p&gt;

&lt;p&gt;llama_index for RAG (so your AI can actually know things from your docs)&lt;/p&gt;

&lt;p&gt;Anthropic for the LLM (the brain)&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We also load environment variables with dotenv—because hardcoding API keys is a rookie mistake.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Embeddings and LLM&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")&lt;br&gt;
llm = Anthropic(model="claude-3-haiku-20240307", max_tokens=512)&lt;br&gt;
Settings.llm = llm&lt;br&gt;
Settings.embed_model = embed_model&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The embedding model turns your docs into “AI food” (vectors).

The LLM (Claude) is the brain that answers questions using those vectors.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Document Indexing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;PERSIST_DIR = "./chat-engine-storage"&lt;br&gt;
if not os.path.exists(PERSIST_DIR):&lt;br&gt;
    documents = SimpleDirectoryReader("docs").load_data()&lt;br&gt;
    index = VectorStoreIndex.from_documents(documents)&lt;br&gt;
    index.storage_context.persist(persist_dir=PERSIST_DIR)&lt;br&gt;
else:&lt;br&gt;
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)&lt;br&gt;
    index = load_index_from_storage(storage_context)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If you haven’t indexed your docs before, it reads everything in docs/ and builds a knowledge base.

If you have, it loads the existing index (so it doesn’t have to re-read your 500-page PDF every time).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Voice Activity Detection (VAD)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;def prewarm(proc: JobProcess):&lt;br&gt;
    proc.userdata["vad"] = silero.VAD.load()&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This makes sure your AI only listens when you’re actually talking, not when you’re yelling at your cat.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;The Entrypoint: Where the Magic Happens&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;async def entrypoint(ctx: JobContext):&lt;br&gt;
    chat_context = ChatContext().append(&lt;br&gt;
        role="system",&lt;br&gt;
        text=(&lt;br&gt;
            "You are a funny, witty assistant."&lt;br&gt;
            "Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."&lt;br&gt;
        ),&lt;br&gt;
    )&lt;br&gt;
    chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)&lt;br&gt;
    ...&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sets the “personality” of your AI (witty, concise, no weird punctuation).

Prepares the chat engine with your indexed docs.

logger.info(f"Connecting to room {ctx.room.name}")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
participant = await ctx.wait_for_participant()
logger.info(f"Starting voice assistant for participant {participant.identity}")

Connects to a LiveKit room (more on this soon).

Waits for a participant (that’s you!) to join.

agent = VoicePipelineAgent(
    vad=ctx.proc.userdata["vad"],
    stt=assemblyai.STT(),
    llm=llama_index.LLM(chat_engine=chat_engine),
    tts=cartesia.TTS(
        model="sonic-2",
        voice="bf0a246a-8642-498a-9950-80c35e9276b5",
    ),
    chat_ctx=chat_context,
)
agent.start(ctx.room, participant)
await agent.say(
    "Hey there! How can I help you today?",
    allow_interruptions=True,
)

Sets up the full voice pipeline: listens, understands, thinks, and talks back.

Greets you with a friendly message.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;The Main Event&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;if &lt;strong&gt;name&lt;/strong&gt; == "&lt;strong&gt;main&lt;/strong&gt;":&lt;br&gt;
    print("Starting voice agent with Anthropic...")&lt;br&gt;
    cli.run_app(&lt;br&gt;
        WorkerOptions(&lt;br&gt;
            entrypoint_fnc=entrypoint,&lt;br&gt;
            prewarm_fnc=prewarm,&lt;br&gt;
        ),&lt;br&gt;
    ) &lt;/p&gt;

&lt;p&gt;Chapter 5: Summoning Your AI (a.k.a. Running the Code)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Install your dependencies (see requirements.txt).

Put your API keys in a .env file:

ANTHROPIC_API_KEY=your_anthropic_key
ASSEMBLYAI_API_KEY=your_assemblyai_key
CARTESIA_API_KEY=your_cartesia_api_key
LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret

Add your documents to the docs/ folder.

Run

python voice_agent_anthropic.py start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Chapter 6: Entering the LiveKit Room&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What’s a LiveKit room?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Think of it as a virtual meeting room where your AI is always waiting for you.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;How do you join?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Use the LiveKit Playground or your own LiveKit client, enter your room name, and your AI will greet you like an old friend (who actually remembers your last conversation).&lt;br&gt;
Chapter 7: The Result&lt;br&gt;
Now, you can:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Talk to your AI: Ask questions, get answers from your own docs.

Get witty, concise responses: No more boring bots!

Impress your friends: “Yeah, my Voice AI actually knows what’s in my files.”
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;if you have any doubts you can contact me 🙂&lt;br&gt;
Happy coding! 🚀&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
