RAG stands for Retrieval Augmented Generation. The idea is simple: before your model answers a question, it first searches a database of relevant knowledge and uses that information to answer better.
I built this entirely without OpenAI — my own embedding model, my own vector database, my own retrieval logic.
Why RAG matters
Without RAG, your model answers purely from what it learned during training. It might hallucinate, it might be outdated, it can't cite sources.
With RAG:
- Question comes in
- System searches a database for relevant facts
- Those facts get added to the prompt
- Model answers using the retrieved context
Think of it like the difference between a doctor answering from memory versus a doctor who can look up references before answering.
The three components
1. Embedder — converts text to vectors
I used pritamdeka/S-PubMedBert-MS-MARCO — a sentence transformer specifically trained on medical and scientific text. It converts any text into a list of 768 numbers that represent its meaning.
"Meningitis causes fever and neck stiffness" → [0.23, -0.41, 0.87, ...]
Similar sentences produce similar vectors. This is what makes search work.
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("pritamdeka/S-PubMedBert-MS-MARCO")
embedding = embedder.encode("Patient has fever and stiff neck")
2. ChromaDB — stores and searches vectors
ChromaDB is a local vector database. I embedded 2,000 medical Q&A pairs and stored them. When a new question comes in, ChromaDB finds the most similar stored examples in milliseconds.
import chromadb
client = chromadb.PersistentClient(path="data/embeddings")
collection = client.create_collection("medical_knowledge")
# Store
collection.add(embeddings=embeddings, documents=texts, ids=ids)
# Search
results = collection.query(query_embeddings=[query_vec], n_results=3)
3. Pipeline — connects everything
The full flow:
def answer(question):
# 1. Retrieve relevant knowledge
docs = retrieve(question, top_k=3)
# 2. Build context
context = "\n".join([d['content'] for d in docs])
# 3. Build prompt with context injected
prompt = f"Medical references:\n{context}\n\nQuestion: {question}\nAnswer:"
# 4. Generate using fine-tuned model
output = model.generate(prompt)
return output, docs
What I learned
The quality of your knowledge base matters as much as the model. I made a mistake early on — I stored MCQ training text as knowledge, which caused the retrieval to return irrelevant results. Clean, factual knowledge chunks work much better.
RAG is not magic. It helps when your knowledge base is relevant and clean. It hurts when it isn't.
Top comments (0)