Rishi Raj Jain

Posted on May 14

Build a Real-Time Voice RAG Agent for Your Documentation

#agents #ai #rag #tutorial

Ever wish you could just ask someone to explain your product or API as you build, or want a way to handle technical questions in meetings without having to ping an engineer or sift through the docs on the spot? Picture being able to pull in a real-time AI agent (think of an AI teammate) to your 1-1 calls, group sessions, or ad-hoc hangouts. You can have it teach you about the product directly, field questions from folks on the team, or even make the agent available for people to schedule quick deep-dive sessions with no need to bug the lead dev or wait for timezone overlap.

This walkthrough shows how to build and plug in an AI voice agent that sits in on your calls, listens for questions, and pulls up answers from your docs in seconds. Whether you're technical or not, it’s like pairing with someone who knows where everything is and doesn’t mind repeating the basics even after hours.

We wire everything up with Vision Agents as the voice agent framework, Stream for WebRTC audio and video, OpenAI Realtime for speech in and speech out, Anam so the agent shows up as a face on the video, and Supermemory so answers come from search over your uploaded documents instead of guesswork. The code stays small and most of the behavior lives in one registered function that asks the memory store for relevant chunks and returns them to the model.

Demo

Prerequisites

You will need the following to get going with the implementation:

Python 3 or later
uv as the Python package manager

What is RAG?

Retrieval-Augmented Generation (RAG) means the language model answers using text retrieved from your own sources (documentation, wikis, tickets) in addition to its weights. A typical loop is to turn the user question into a search query, pull the top matching passages from an index, and pass those passages into the model as context so it grounds its reply in your material. That cuts hallucinations on proprietary facts and lets answers track updates to your docs without retraining the model.

Let's look at two best practices for building high-quality RAG systems in production today:

Hybrid Search

Think of hybrid search as bringing together two different search engines to get the best of both worlds: one that understands meaning (semantic search) and one that matches keywords (like a classic text search). Semantic search is great at catching the “vibe” or intent, even if someone asks for something in a totally different way. Meanwhile, keyword search is your go-to for finding specifics, like exact function names, class names, or error codes.

By letting both approaches take a shot and then combining their results (using methods like reciprocal rank fusion), you dodge the common failure modes of each. Doing this means you’re much less likely to miss the chunk of doc you really wanted, even if your question is a little funky.

Re-ranking

Re-ranking is basically a quality filter for search results. First, you cast a wide net and pull back a bunch of potentially relevant chunks, ensuring that you are not missing anything useful. Then, a smarter model (cross-encoder or dedicated reranker) takes a closer look, comparing your question directly to each chunk, and shuffles the list to put the best matches at the top.

This two-step approach pays off in the sense that you get all the coverage of a broad search but with the accuracy of a model that actually “reads” for context. In practice, this means you get better, more grounded answers from your search compared to just using a simple top-k retrieval.

In the Supermemory example below, using rerank=True in the document search API enables this second-stage refinement i.e. the platform reranks the initially retrieved results, so only the best chunks are provided to the model.

Setting Up Environment Variables

To configure your application, set up the necessary environment variables for each integration. Follow the steps below for each provider:

Create a Stream application

Go to the Stream dashboard.
Choose + Create an App.
In the dialog, type a name for your application and pick the region for your edge-server location(s).
After creation, locate the API Key and Secret under Your Credentials.

Add these credentials to your .env file as follows:

   STREAM_API_KEY="your-api-key"
   STREAM_API_SECRET="your-api-secret"

Configure OpenAI

Access the OpenAI API Key dashboard.
Click + Create new secret key.
Add it to .env:

   OPENAI_API_KEY="your-openai-api-key"

Ensure your organization can use the Realtime API and the model you pick in code (here, gpt-realtime-1.5).

Configure Anam

Access the Anam API Key dashboard.
Click + to create a new key.
Add the generated key to your .env file:

   ANAM_API_KEY="your-anam-api-key"

Open Anam Build.
Click Avatar, hover the avatar you want, and click Copy ID.

Add the ID to your .env file:

   ANAM_AVATAR_ID="your-anam-avatar-id"

Configure Supermemory

Open the Supermemory API keys page.
Use + CREATE KEY to create a key.
Add the key to .env:

   SUPERMEMORY_API_KEY="your-supermemory-api-key"

That's it for configuring your environment variables. Next, let's move on to scripting the AI voice agent.

Set up a new Python application

In this section, you will learn how to create a new Python application, set up vision agents in it, and install relevant libraries for a quick implementation.

Let’s get started by creating a new Python project. Open your terminal and run the following commands:

mkdir voice-agent && cd voice-agent
uv init && uv add "vision-agents[anam,getstream,openai,redis]" python-dotenv supermemory ultimate-sitemap-parser

Now, create a .env file at the root of your project. You are going to add the items we saved from the above sections.

It should look something like this:

# .env

## Stream
STREAM_API_KEY="..."
STREAM_API_SECRET="..."

## OpenAI (Realtime)
OPENAI_API_KEY="sk-proj-..."

## Anam
ANAM_API_KEY="..."
ANAM_AVATAR_ID="...-...-...-...-..."

## Supermemory
SUPERMEMORY_API_KEY="..."

Then add main.py with a minimal agent - Stream for WebRTC and OpenAI Realtime for speech:

from dotenv import load_dotenv
from vision_agents.core import Agent, AgentLauncher, Runner
from vision_agents.core.edge.types import User
from vision_agents.plugins import getstream, openai

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    agent = Agent(
        edge=getstream.Edge(),  # GetStream edge network for AV transport
        agent_user=User(name="Docs AI", id="agent"),  # Agent's identity
        instructions=("You are a helpful assistant on this call."),
        llm=openai.Realtime(model="gpt-realtime-1.5", voice="ash"),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    # Context manager handles join and clean-up automatically
    async with agent.join(call):
        await agent.finish()

if __name__ == "__main__":
    # Entrypoint for CLI: launches agent and runs the join_call automatically when invoked from the command line
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

The code block above loads the environment variables, then create_agent returns an Agent that connects to Stream using getstream.Edge() and converses in real-time using openai.Realtime with a chosen model and voice. The Docs AI name is what people see in the call. join_call method creates the call, joins it, and exits cleanly when the call finishes.

Bring Your Voice Agent to Life with an Anam Avatar

Anam adds a real time interactive avatar to the agent. The avatar speaks with natural movements and automatic lip sync, which makes a voice agent feel like a present participant on a video call instead of an invisible bot.

Vision Agents ships an Anam integration as a processor. A processor in this context is a flexible building block i.e. it can be an avatar like Anam, a computer vision model (like YOLO), or any custom Python code designed to intercept raw WebRTC audio and video streams, perform some manipulation or transformation, and then republish the processed result back into the Stream call. For Anam, the processor takes audio output from the agent, streams it to the Anam service, and then publishes the resulting animated avatar video and audio as part of the group call experience.

Make the following changes to add Anam avatar to your agent:

# existing imports

+ from vision_agents.plugins.anam import AnamAvatarPublisher

async def create_agent(**kwargs) -> Agent:
    ...
    agent = Agent(
        ...
+        processors=[AnamAvatarPublisher()],
    )

    return agent

# rest of the file

The code block above integrates Anam avatars into your agent by importing AnamAvatarPublisher and adding it to the agent's processors list, enabling the agent to produce a real-time animated avatar with expressions in calls.

Index your documentation

Supermemory is the memory layer for AI agents, a context-engineering platform that exposes APIs for adding knowledge and retrieving it later.

When you add documents, you send raw content (plain text, URLs, uploads, and more) and then Supermemory runs a processing pipeline that validates and stores the request, extracts text when needed (scraping for URLs), chunks that material into searchable pieces, embeds those chunks for vector search, and indexes them for retrieval.

Create a sitemap.py file with the code below to automatically crawl and index your website's public docs pages into Supermemory for retrieval by your Docs AI agent:

from dotenv import load_dotenv
from supermemory import Supermemory
from usp.tree import sitemap_tree_for_homepage

load_dotenv()

supermemory_client = Supermemory()

def add_sitemap_to_memory(url: str) -> list[str]:
    tree = sitemap_tree_for_homepage(url)
    for link in enumerate(tree.all_pages()):
        supermemory_client.documents.add(content=link.url)

if __name__ == "__main__":
    add_sitemap_to_memory("https://your-website-url.com/")

Then, run the indexing script via the following command in your terminal:

uv run sitemap.py

Finally, wait until indexing finishes for those documents in the Supermemory console before relying on search in the voice agent.

Ask your documentation

The voice agent in your calls needs a way to search the documents you have indexed in Supermemory. In the agentic world, one of the ways to achieve this is via function calling. Function calling is when the model can call a tool you define instead of only replying in plain text. Your code runs, returns a result, and the model uses that result in what it says next. That function can be anything you would run in normal app code i.e. HTTP calls, database queries, or in our case supermemory_client.search.documents to pull ranked chunks from the index you built earlier.

Vision Agents wires this up with a decorator on your LLM as @agent.llm.register_function() which exposes an async Python function to the model, with a short description so it knows when and how to use it. The framework handles registration and the call loop while having you focus on creating the function body.

Now, make the following changes to add the function call for searching the documents:

# existing imports

+ from supermemory import Supermemory

async def create_agent(**kwargs) -> Agent:
+     supermemory_client = Supermemory()
      agent = Agent(
          ...
          instructions=(
-             "You are a helpful assistant on this call."
+             """You are a helpful document Q&A assistant. When asked a question, use the ask_docs function.
+
+ INSTRUCTIONS:
+ 1. Answer the question using ONLY the information from the provided response.
+ 2. If the response doesn't contain enough information, say so clearly
+ 3. Be accurate and quote directly when possible
+ 4. Maintain a helpful, professional tone
+ 5. If you do not know the answer, say so clearly
+ """
          ),
      )
+     @agent.llm.register_function(
+         description="Ask a question in natural language about the documentation"
+     )
+     async def ask_docs(question: str) -> str:
+         """Ask a question in natural language about the documentation"""
+         search_results = supermemory_client.search.documents(
+             q=question, rerank=True, include_full_docs=False, include_summary=True
+         )
+         if not search_results.results:
+             return "I couldn't find any relevant information in the uploaded documents to answer your question."
+         context_parts = []
+         for index, result in enumerate(search_results.results):
+             relevant_chunks = [
+                 chunk.content for chunk in result.chunks if chunk.is_relevant
+             ][:3]
+             chunk_text = "\n\n".join(relevant_chunks)
+             context_parts.append(
+                 f'[Document {index + 1}: "{result.title}"]\n{chunk_text}'
+             )
+         return "\n\n---\n\n".join(context_parts)
      return agent

# rest of the file

The code block above imports the Supermemory client, initializes it, and passes it into the agent. The instructions for the agent are detailed so as to ensure the model only uses retrieved document context, is accurate, and admits any knowledge gap. The ask_docs function is registered as a callable tool for the model i.e. when the agent receives a question, it queries Supermemory for relevant documents, extracts and joins their most relevant chunks, and structures the output with document titles for clear answers. If no relevant results are found, it lets the user know. This enables the model to reliably and transparently answer questions using only information from your uploaded documentation.

How to Run

To launch the service locally, use the following command in your terminal:

uv run main.py run

It will open a demo at demo.visionagents.ai and automatically join the session for you. You can then interact with your teammate in real time, as demonstrated in the demo.

The terminal will display the following output prior to the agent joining the call, confirming that the ask_docs tool has been automatically registered:

% uv run main.py run

14:16:18.214 | INFO     | 🚀 Launching agent...
...
14:16:19.101 | INFO     | [Agent: agent] | 🤖 Agent joining call: 27d3df64-8df7-404f-be3a-2c25374a521b
+ 14:16:19.101 | INFO     | Added 1 tools to session config: ['ask_docs']
14:16:19.733 | INFO     | 🌐 Opening browser to: https://demo.visionagents.ai/join/...
...

Ending thoughts

You now have a voice participant that can sit in on a call, pull answers from your indexed docs, and speak them back with a face on the video. This is especially valuable for teammates who aren't deep in the codebase or may not be available due to timezone differences. It's exciting to see how agents like this are evolving to meet real-world use cases, and how frameworks such as Vision Agents make it straightforward to build and integrate these capabilities into your workflows.

Top comments (1)

Harjot Singh • May 31

A voice agent grounded in your actual docs is a genuinely useful shape, because the failure that would kill it is the one RAG specifically prevents: an ungrounded voice assistant confidently stating a wrong API detail in a live meeting is far worse than silence, since voice carries authority and nobody's going to fact-check it mid-call. Grounding it in the real documentation turns recall into lookup, so the answer is only as wrong as your docs (and stale docs become the thing to watch, not hallucination). The hard part voice adds on top of RAG is latency and turn-taking: a text RAG assistant can take a beat, but in a live conversation the perceived quality is dominated by how naturally it handles interruption and how fast first-token comes, not end-to-end accuracy. The combo I'd obsess over is fast-grounded-and-willing-to-say-I-don't-know, because in a meeting an honest let me check beats a fluent fabrication. That ground-it-and-let-it-abstain stance is core to how I think about agents in Moonshift. How are you handling the case where retrieval finds nothing relevant, does the voice agent abstain gracefully, or does it tend to fill the silence with a guess?