Alright — this is where things finally get interesting. In Part II, we already prepared everything:
Our documents are processed
Chunks are created
Embeddings are stored inside ChromaDB
So technically… our chatbot already has knowledge. But here’s the problem: it still doesn’t know how to use it. Right now, our system is just a “smart storage”. Not a “smart chatbot”. In this part, we’re going to fix that.
From Storage → Intelligence
Let’s quickly recall how RAG actually works. Instead of answering directly, the system:
Converts the user question into an embedding
Searches the vector database
Retrieves relevant chunks
Sends them to the LLM as context
Generates an answer
This is what transforms a normal chatbot into a domain-specific assistant.
Step 1 — Converting the User Query into an Embedding
When a user asks something, we don’t send it directly to the model. We first convert it into a vector.
query = "How much is cash balances (Kas) for 2024?"
query_embedding = embed([query])[0]
Why? Because our database doesn’t understand raw text — it understands vectors.
Step 2 — Searching the Vector Database
Now we use that embedding to find the most relevant chunks.
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)
This will return the top 3 most relevant pieces of text.
Step 3 — Preparing Context
Combine those chunks into a single context to provide to the LLM.
context = "\n\n".join(results)
# or, if `results` is a list of dicts with a 'text' field:
# context = "\n\n".join([r['text'] for r in results])
Step 4 — Sending Context to the LLM
Pass the combined context plus the original user question to the LLM so it can generate an informed, domain-specific answer.
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
response = llm.generate(prompt)
Step 5 — Generating the Answer
The LLM uses the retrieved context to produce a response grounded in your documents. This is the step that turns a plain vector store into an intelligent, domain-aware chatbot.
That’s the core RAG flow — convert query → retrieve relevant chunks → provide context → generate answer. In the next part we’ll look at optimizing retrieval quality, prompt engineering, and handling long contexts.
If you want to try it out yourself, check out Nebula Lab here: https://ai-nebula.com/.




Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.