Torkian

Posted on May 23 • Edited on Jul 15

From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM

#nvidia #ai #python #tutorial

In Part 1, we built a USC campus assistant by pasting a five-line knowledge base directly into the prompt. That works when "the data" fits in your head. It stops being cute the moment the campus handbook, club docs, and workshop notes all want a seat at the same prompt window.

The fix is retrieval — store the chunks once, and at query time pull only the few that look relevant. That's what RAG (Retrieval-Augmented Generation) actually means once you strip away the marketing.

This post takes the assistant from Part 1 and bolts on a real retriever, using NVIDIA's hosted embedding model. No vector database, no LangChain, no abstraction layer. A Python list and NumPy are enough to understand what's actually happening. Once you've seen the moving parts, swapping in pgvector or Pinecone later is a fifteen-minute job.

I'm B Torkian, NVIDIA Developer Champion at USC. Same workshop series, same campus, one more capability added.

What you're adding

User question → embed query → compare to stored chunks → pick top-k → send only those to the LLM → answer

The model call itself barely changes. The work is in steps 2–4: turn text into vectors, compare vectors, return the closest chunks.

Why the manual approach from Part 1 breaks

In Part 1, the entire knowledge base sat inside the prompt:

campus_info = """
The USC AI Club meets every Thursday at 5 PM...
The USC GPU computing lab is open Monday to Friday...
...
"""

Five lines is fine. But every model has a context window, and every token costs money and latency. You don't want to paste the entire USC student handbook into every question — most of it is irrelevant to "when does the AI Club meet?"

Retrieval is the answer to "which 3 paragraphs out of 3000 are actually about this question?" You compute that before calling the LLM, then send only the winners.

What an embedding actually is

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts that mean similar things land near each other in vector space. Two texts that mean different things land far apart.

NVIDIA's nv-embedqa-e5-v5 is an embedding model tuned specifically for question-answer retrieval. It has a quirk worth knowing about up front — it treats queries and passages differently. You tell it which one you're embedding via an input_type parameter. Getting this wrong is the most common beginner mistake — it still runs, but retrieval quality drops noticeably.

input_type='passage' → use for the documents you store
input_type='query' → use for the user's question at search time

That's it. Same model, two modes.

Step 1: Set up the client and `ask()` from Part 1

If you're continuing from Part 1, you already have these defined and can skip this cell. If you're starting fresh, paste this in first — everything later builds on it.

%pip install -q openai numpy

import os, getpass
from openai import OpenAI

if not os.getenv('NVIDIA_API_KEY'):
    os.environ['NVIDIA_API_KEY'] = getpass.getpass('Paste your NVIDIA API key (starts with nvapi-): ')

client = OpenAI(
    base_url='https://integrate.api.nvidia.com/v1',
    api_key=os.environ['NVIDIA_API_KEY'],
)

MODEL = 'meta/llama-3.1-8b-instruct'

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

client calls NVIDIA's API Catalog. ask() is the same chat-completion shape from Part 1. The retriever we're about to build slots in next to these, not instead of them.

Step 2: Build a small knowledge base and embed it as passages

import numpy as np

EMBED_MODEL = 'nvidia/nv-embedqa-e5-v5'

knowledge_base = [
    {'title': 'USC AI Club meeting',
     'text': 'The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.'},
    {'title': 'USC GPU lab hours',
     'text': 'The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.'},
    {'title': 'NVIDIA Developer Program',
     'text': 'USC students can join the NVIDIA Developer Program for free.'},
    {'title': 'Next USC workshop',
     'text': 'The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG).'},
    {'title': 'USC AI/ML office hours',
     'text': 'Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.'},
    {'title': 'USC robotics lab',
     'text': 'The USC robotics lab requires safety training before students can use the soldering station.'},
    {'title': 'USC tutoring',
     'text': 'Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM.'},
]

def embed_texts(texts, input_type='passage'):
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=texts,
        extra_body={'input_type': input_type},
    )
    return [np.array(item.embedding, dtype=np.float32) for item in response.data]

# Embed every chunk once, as a passage. Store the vector alongside the text.
embeddings = embed_texts([item['text'] for item in knowledge_base], input_type='passage')
for item, embedding in zip(knowledge_base, embeddings):
    item['embedding'] = embedding

print(f'Embedded {len(knowledge_base)} chunks. Vector dim:', embeddings[0].shape[0])

Two things to notice:

The OpenAI Python client doesn't have a native field for NVIDIA's input_type, so we pass it through extra_body. That's the right way to send provider-specific arguments without forking the client.
We're storing the embeddings in plain Python dicts. For seven chunks this is fine. For seven thousand, you'd reach for a vector database (and the only thing that changes is where the vectors live; the cosine math is identical).

Step 3: Retrieve the top-k chunks for a question

def cosine_similarity(a, b):
    denominator = np.linalg.norm(a) * np.linalg.norm(b)
    if denominator == 0:
        return 0.0
    return float(np.dot(a, b) / denominator)

def retrieve_context(question, k=3):
    question_embedding = embed_texts([question], input_type='query')[0]

    scored = []
    for item in knowledge_base:
        score = cosine_similarity(question_embedding, item['embedding'])
        scored.append((score, item))

    scored.sort(key=lambda pair: pair[0], reverse=True)
    top_items = [item for score, item in scored[:k]]

    return '\n'.join(f"- {item['text']}" for item in top_items)

Three things are happening here:

The question is embedded as a query, not a passage. This is the part beginners trip over. Same model, different mode.
Cosine similarity scores how close the question vector is to each stored chunk vector. Numbers near 1.0 mean very similar; numbers near 0 mean unrelated.
Top-k picks the highest-scoring chunks. Three is a reasonable default for a tiny knowledge base; tune it for yours.

There is no magic in step 3. A vector database would do the same comparison but use indexing tricks to do it fast at scale.

Step 4: Plug retrieval into the same `ask()` from Part 1

def ask_with_retrieval(question):
    context = retrieve_context(question)

    system_prompt = f"""You are a USC campus assistant. Answer ONLY using the
context below. If the answer is not in the context, say
"I don't have that information — check with the USC AI Club."

CONTEXT:
{context}
"""

    return ask(system_prompt, question)

for question in [
    'Where does the USC AI Club meet?',
    'When can I get Python tutoring at USC?',
    'What is the wifi password?',
]:
    print(f'Q: {question}')
    print(f'Context:\n{retrieve_context(question)}')
    print(f'A: {ask_with_retrieval(question)}\n')

Run it. Three things to read carefully:

The first question retrieves the AI Club chunk and answers from it. Good.
The second retrieves the tutoring chunk and answers from it. The stored text says "peer tutoring for introductory Python" — not the exact phrase "Python tutoring" — and the embedding model matches them on meaning. (A keyword search would also have found this one; the semantic win gets bigger as your data grows and the wording diverges from the question.)
The wifi question retrieves three chunks anyway (top-k always returns k items), but none of them contain a password. The assistant falls back to the refusal line because the ONLY using the context rule forces it to. That's the guardrail from Part 1 doing its job — and it's exactly the bridge into Part 3.

Step 5: What you actually did

You replaced the hand-picked campus_info string from Part 1 with a real retrieval step. The model call is identical, and the system prompt follows the same guardrail pattern — answer only from the provided context, otherwise fall back. The only structural change is that {context} now comes from a function instead of a hardcoded constant.

That swap is the entire mental model behind RAG. Real production systems add chunking strategies, hybrid search, re-ranking, and a vector database — but the spine stays the same: embed once, embed query, compare, pass top-k to the LLM.

In your own work, the seven-line knowledge_base becomes hundreds of paragraphs scraped from PDFs, lecture notes, club Slack archives, Notion pages, or a wiki. The retriever code doesn't change. The dict-with-vector storage gets replaced by something like pgvector, Qdrant, or Pinecone the moment you outgrow a Python list.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for Part 2: Open part2_rag.ipynb
Local Python: part2_rag.py in the repo (python3 part2_rag.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2 (this post): From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

Follow this series on dev.to (the series widget at the top of each post lists every published part in order).

DEV Community

From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM

What you're adding

Why the manual approach from Part 1 breaks

What an embedding actually is

Step 1: Set up the client and `ask()` from Part 1

Step 2: Build a small knowledge base and embed it as passages

Step 3: Retrieve the top-k chunks for a question

Step 4: Plug retrieval into the same `ask()` from Part 1

Step 5: What you actually did

Get the code

The full series

Top comments (0)

What you're adding

Why the manual approach from Part 1 breaks

What an embedding actually is

Step 1: Set up the client and ask() from Part 1

Step 2: Build a small knowledge base and embed it as passages

Step 3: Retrieve the top-k chunks for a question

Step 4: Plug retrieval into the same ask() from Part 1

Step 5: What you actually did

Get the code

The full series

Step 1: Set up the client and `ask()` from Part 1

Step 4: Plug retrieval into the same `ask()` from Part 1