DEV Community

Torkian
Torkian

Posted on

From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM

In Part 1, we built a USC campus assistant by pasting a five-line knowledge base directly into the prompt. That works when "the data" fits in your head. It stops being cute the moment the campus handbook, club docs, and workshop notes all want a seat at the same prompt window.

The fix is retrieval — store the chunks once, and at query time pull only the few that look relevant. That's what RAG (Retrieval-Augmented Generation) actually means once you strip away the marketing.

This post takes the assistant from Part 1 and bolts on a real retriever, using NVIDIA's hosted embedding model. No vector database, no LangChain, no abstraction layer. A Python list and NumPy are enough to understand what's actually happening. Once you've seen the moving parts, swapping in pgvector or Pinecone later is a fifteen-minute job.

I'm B Torkian, NVIDIA Developer Champion at USC. Same workshop series, same campus, one more capability added.


What you're adding

User question → embed query → compare to stored chunks → pick top-k → send only those to the LLM → answer
Enter fullscreen mode Exit fullscreen mode

The model call itself barely changes. The work is in steps 2–4: turn text into vectors, compare vectors, return the closest chunks.


Why the manual approach from Part 1 breaks

In Part 1, the entire knowledge base sat inside the prompt:

campus_info = """
The USC AI Club meets every Thursday at 5 PM...
The USC GPU computing lab is open Monday to Friday...
...
"""
Enter fullscreen mode Exit fullscreen mode

Five lines is fine. But every model has a context window, and every token costs money and latency. You don't want to paste the entire USC student handbook into every question — most of it is irrelevant to "when does the AI Club meet?"

Retrieval is the answer to "which 3 paragraphs out of 3000 are actually about this question?" You compute that before calling the LLM, then send only the winners.


What an embedding actually is

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts that mean similar things land near each other in vector space. Two texts that mean different things land far apart.

NVIDIA's nv-embedqa-e5-v5 is an embedding model tuned specifically for question-answer retrieval. It has a quirk worth knowing about up front — it treats queries and passages differently. You tell it which one you're embedding via an input_type parameter. Getting this wrong is the most common beginner mistake — it still runs, but retrieval quality drops noticeably.

  • input_type='passage' → use for the documents you store
  • input_type='query' → use for the user's question at search time

That's it. Same model, two modes.


Step 1: Set up the client and ask() from Part 1

If you're continuing from Part 1, you already have these defined and can skip this cell. If you're starting fresh, paste this in first — everything later builds on it.

%pip install -q openai numpy

import os, getpass
from openai import OpenAI

if not os.getenv('NVIDIA_API_KEY'):
    os.environ['NVIDIA_API_KEY'] = getpass.getpass('Paste your NVIDIA API key (starts with nvapi-): ')

client = OpenAI(
    base_url='https://integrate.api.nvidia.com/v1',
    api_key=os.environ['NVIDIA_API_KEY'],
)

MODEL = 'meta/llama-3.1-8b-instruct'

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

client calls NVIDIA's API Catalog. ask() is the same chat-completion shape from Part 1. The retriever we're about to build slots in next to these, not instead of them.


Step 2: Build a small knowledge base and embed it as passages

import numpy as np

EMBED_MODEL = 'nvidia/nv-embedqa-e5-v5'

knowledge_base = [
    {'title': 'USC AI Club meeting',
     'text': 'The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.'},
    {'title': 'USC GPU lab hours',
     'text': 'The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.'},
    {'title': 'NVIDIA Developer Program',
     'text': 'USC students can join the NVIDIA Developer Program for free.'},
    {'title': 'Next USC workshop',
     'text': 'The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG).'},
    {'title': 'USC AI/ML office hours',
     'text': 'Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.'},
    {'title': 'USC robotics lab',
     'text': 'The USC robotics lab requires safety training before students can use the soldering station.'},
    {'title': 'USC tutoring',
     'text': 'Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM.'},
]

def embed_texts(texts, input_type='passage'):
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=texts,
        extra_body={'input_type': input_type},
    )
    return [np.array(item.embedding, dtype=np.float32) for item in response.data]

# Embed every chunk once, as a passage. Store the vector alongside the text.
embeddings = embed_texts([item['text'] for item in knowledge_base], input_type='passage')
for item, embedding in zip(knowledge_base, embeddings):
    item['embedding'] = embedding

print(f'Embedded {len(knowledge_base)} chunks. Vector dim:', embeddings[0].shape[0])
Enter fullscreen mode Exit fullscreen mode

Two things to notice:

  • The OpenAI Python client doesn't have a native field for NVIDIA's input_type, so we pass it through extra_body. That's the right way to send provider-specific arguments without forking the client.
  • We're storing the embeddings in plain Python dicts. For seven chunks this is fine. For seven thousand, you'd reach for a vector database (and the only thing that changes is where the vectors live; the cosine math is identical).

Step 3: Retrieve the top-k chunks for a question

def cosine_similarity(a, b):
    denominator = np.linalg.norm(a) * np.linalg.norm(b)
    if denominator == 0:
        return 0.0
    return float(np.dot(a, b) / denominator)

def retrieve_context(question, k=3):
    question_embedding = embed_texts([question], input_type='query')[0]

    scored = []
    for item in knowledge_base:
        score = cosine_similarity(question_embedding, item['embedding'])
        scored.append((score, item))

    scored.sort(key=lambda pair: pair[0], reverse=True)
    top_items = [item for score, item in scored[:k]]

    return '\n'.join(f"- {item['text']}" for item in top_items)
Enter fullscreen mode Exit fullscreen mode

Three things are happening here:

  1. The question is embedded as a query, not a passage. This is the part beginners trip over. Same model, different mode.
  2. Cosine similarity scores how close the question vector is to each stored chunk vector. Numbers near 1.0 mean very similar; numbers near 0 mean unrelated.
  3. Top-k picks the highest-scoring chunks. Three is a reasonable default for a tiny knowledge base; tune it for yours.

There is no magic in step 3. A vector database would do the same comparison but use indexing tricks to do it fast at scale.


Step 4: Plug retrieval into the same ask() from Part 1

def ask_with_retrieval(question):
    context = retrieve_context(question)

    system_prompt = f"""You are a USC campus assistant. Answer ONLY using the
context below. If the answer is not in the context, say
"I don't have that information — check with the USC AI Club."

CONTEXT:
{context}
"""

    return ask(system_prompt, question)

for question in [
    'Where does the USC AI Club meet?',
    'When can I get Python tutoring at USC?',
    'What is the wifi password?',
]:
    print(f'Q: {question}')
    print(f'Context:\n{retrieve_context(question)}')
    print(f'A: {ask_with_retrieval(question)}\n')
Enter fullscreen mode Exit fullscreen mode

Run it. Three things to read carefully:

  • The first question retrieves the AI Club chunk and answers from it. Good.
  • The second retrieves the tutoring chunk and answers from it. Notice that "Python tutoring" doesn't appear verbatim in the stored text — the chunk says "introductory Python" — but the embedding model knows those are semantically close. That's the whole point of vector search over keyword search.
  • The wifi question retrieves three chunks anyway (top-k always returns k items), but none of them contain a password. The assistant falls back to the refusal line because the ONLY using the context rule forces it to. That's the guardrail from Part 1 doing its job — and it's exactly the bridge into Part 3.

Step 5: What you actually did

You replaced the hand-picked campus_info string from Part 1 with a real retrieval step. The model call is identical, and the system prompt follows the same guardrail pattern — answer only from the provided context, otherwise fall back. The only structural change is that {context} now comes from a function instead of a hardcoded constant.

That swap is the entire mental model behind RAG. Real production systems add chunking strategies, hybrid search, re-ranking, and a vector database — but the spine stays the same: embed once, embed query, compare, pass top-k to the LLM.

In your own work, the seven-line knowledge_base becomes hundreds of paragraphs scraped from PDFs, lecture notes, club Slack archives, Notion pages, or a wiki. The retriever code doesn't change. The dict-with-vector storage gets replaced by something like pgvector, Qdrant, or Pinecone the moment you outgrow a Python list.


Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open the Part 1 notebook — paste the Part 2 cells underneath.

MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.


Previously / next in this series

  • Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
  • Part 3 (next): Add Guardrails So It Doesn't Lie — a two-layer approach using prompt scope + a tiny verifier call. The fallback line that fired on the wifi question above is the foundation we build on.

Top comments (0)