DEV Community

Cover image for How I Built a Talking, Knowledgeable AI Sidekick (and How You Can too build a Voice AI RAG agent )
Mohammad
Mohammad

Posted on

How I Built a Talking, Knowledgeable AI Sidekick (and How You Can too build a Voice AI RAG agent )

A Story of Code and a Chatty Voice AI agent That Actually Knows Stuff from your docs

Chapter 1: The Dream
It all started on a rainy afternoon. I was talking to my computer (as one does when he is a remote worker), and realized:
Wouldn’t it be cool if my computer could actually listen, understand, and answer me with real knowledge from my own files?
Not just “Hey Siri, what’s the weather?” but “Hey AI, what’s in my project docs?” or “Remind me what the HR policy says about bringing cats to work?”
And so, the quest began:
I would build a Voice AI RAG agent!
(That’s Retrieval-Augmented Generation, but let’s just call it “RAG” because it sounds like a pirate.)
Chapter 2: The Ingredients
Before you can summon your own digital sidekick, you’ll need a few magical artifacts:

Python 3.11+ (the spellbook)

Cartesia (for making your AI talk like a human, not a fax machine)

AssemblyAI (so your AI can understand your voice, even if you mumble)

Anthropic Claude (the brain—OpenAI is cool, but Claude is the new wizard in town)

LiveKit (for real-time voice rooms, so your AI can join you in a virtual “room”)

A pile of your own documents (so your AI knows your world)

API keys (the secret runes—don’t lose them!)
Enter fullscreen mode Exit fullscreen mode

Chapter 3: The Spell (a.k.a. The Code)
Here’s the full incantation. Don’t worry, I’ll explain every part after you read it.
(Copy, paste, and prepare to be amazed!)

import logging
import os
from dotenv import load_dotenv
from livekit.agents import JobContext, JobProcess, WorkerOptions, cli
from livekit.agents.job import AutoSubscribe
from livekit.agents.llm import (
ChatContext,
)
from livekit.agents.pipeline import VoicePipelineAgent
from livekit.plugins import cartesia, silero, llama_index, assemblyai

load_dotenv()

logger = logging.getLogger("voice-assistant")
from llama_index.llms.anthropic import Anthropic
from llama_index.core import (
SimpleDirectoryReader,
StorageContext,
VectorStoreIndex,
load_index_from_storage,
Settings
)
from llama_index.core.chat_engine.types import ChatMode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

load_dotenv()

Set up the embedding model and LLM

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = Anthropic(model="claude-3-haiku-20240307", max_tokens=512)
Settings.llm = llm
Settings.embed_model = embed_model

check if storage already exists

PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
# load the documents and create the index
documents = SimpleDirectoryReader("docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# store it for later
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# load the existing index
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)

def prewarm(proc: JobProcess):
proc.userdata["vad"] = silero.VAD.load()

async def entrypoint(ctx: JobContext):

chat_context = ChatContext().append(
    role="system",
    text=(
        "You are a funny, witty assistant."
        "Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
    ),
)

chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)

logger.info(f"Connecting to room {ctx.room.name}")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

participant = await ctx.wait_for_participant()
logger.info(f"Starting voice assistant for participant {participant.identity}")

agent = VoicePipelineAgent(
    vad=ctx.proc.userdata["vad"],
    stt=assemblyai.STT(),
    llm=llama_index.LLM(chat_engine=chat_engine),
    tts=cartesia.TTS(
        model="sonic-2",
        voice="bf0a246a-8642-498a-9950-80c35e9276b5",
    ),
    chat_ctx=chat_context,
)

agent.start(ctx.room, participant)

await agent.say(
    "Hey there! How can I help you today?",
    allow_interruptions=True,
)
Enter fullscreen mode Exit fullscreen mode

if name == "main":
print("Starting voice agent with Anthropic...")
cli.run_app(
WorkerOptions(
entrypoint_fnc=entrypoint,
prewarm_fnc=prewarm,
),
)

Chapter 4: The Magic Explained
Let’s break down this spellbook, line by line:

  1. Imports and Setup
    We import all the libraries:

    livekit for voice rooms

    cartesia for text-to-speech

    assemblyai for speech-to-text

    llama_index for RAG (so your AI can actually know things from your docs)

    Anthropic for the LLM (the brain)

We also load environment variables with dotenv—because hardcoding API keys is a rookie mistake.

  1. Embeddings and LLM

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = Anthropic(model="claude-3-haiku-20240307", max_tokens=512)
Settings.llm = llm
Settings.embed_model = embed_model

The embedding model turns your docs into “AI food” (vectors).

The LLM (Claude) is the brain that answers questions using those vectors.
Enter fullscreen mode Exit fullscreen mode
  1. Document Indexing

PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
documents = SimpleDirectoryReader("docs").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)

If you haven’t indexed your docs before, it reads everything in docs/ and builds a knowledge base.

If you have, it loads the existing index (so it doesn’t have to re-read your 500-page PDF every time).
Enter fullscreen mode Exit fullscreen mode
  1. Voice Activity Detection (VAD)

def prewarm(proc: JobProcess):
proc.userdata["vad"] = silero.VAD.load()

This makes sure your AI only listens when you’re actually talking, not when you’re yelling at your cat.
Enter fullscreen mode Exit fullscreen mode
  1. The Entrypoint: Where the Magic Happens

async def entrypoint(ctx: JobContext):
chat_context = ChatContext().append(
role="system",
text=(
"You are a funny, witty assistant."
"Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
),
)
chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)
...

Sets the “personality” of your AI (witty, concise, no weird punctuation).

Prepares the chat engine with your indexed docs.

logger.info(f"Connecting to room {ctx.room.name}")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
participant = await ctx.wait_for_participant()
logger.info(f"Starting voice assistant for participant {participant.identity}")

Connects to a LiveKit room (more on this soon).

Waits for a participant (that’s you!) to join.

agent = VoicePipelineAgent(
    vad=ctx.proc.userdata["vad"],
    stt=assemblyai.STT(),
    llm=llama_index.LLM(chat_engine=chat_engine),
    tts=cartesia.TTS(
        model="sonic-2",
        voice="bf0a246a-8642-498a-9950-80c35e9276b5",
    ),
    chat_ctx=chat_context,
)
agent.start(ctx.room, participant)
await agent.say(
    "Hey there! How can I help you today?",
    allow_interruptions=True,
)

Sets up the full voice pipeline: listens, understands, thinks, and talks back.

Greets you with a friendly message.
Enter fullscreen mode Exit fullscreen mode
  1. The Main Event

if name == "main":
print("Starting voice agent with Anthropic...")
cli.run_app(
WorkerOptions(
entrypoint_fnc=entrypoint,
prewarm_fnc=prewarm,
),
)

Chapter 5: Summoning Your AI (a.k.a. Running the Code)

Install your dependencies (see requirements.txt).

Put your API keys in a .env file:

ANTHROPIC_API_KEY=your_anthropic_key
ASSEMBLYAI_API_KEY=your_assemblyai_key
CARTESIA_API_KEY=your_cartesia_api_key
LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret

Add your documents to the docs/ folder.

Run

python voice_agent_anthropic.py start
Enter fullscreen mode Exit fullscreen mode

Chapter 6: Entering the LiveKit Room

What’s a LiveKit room?
Enter fullscreen mode Exit fullscreen mode

Think of it as a virtual meeting room where your AI is always waiting for you.

How do you join?
Enter fullscreen mode Exit fullscreen mode

Use the LiveKit Playground or your own LiveKit client, enter your room name, and your AI will greet you like an old friend (who actually remembers your last conversation).
Chapter 7: The Result
Now, you can:

Talk to your AI: Ask questions, get answers from your own docs.

Get witty, concise responses: No more boring bots!

Impress your friends: “Yeah, my Voice AI actually knows what’s in my files.”
Enter fullscreen mode Exit fullscreen mode

if you have any doubts you can contact me 🙂
Happy coding! 🚀

Top comments (2)

Collapse
 
sleman-jim_4938b14 profile image
Leyman jin

Nicely explained, Terima kasih buddy

Some comments may only be visible to logged-in visitors. Sign in to view all comments.