DEV Community

Cover image for Building a RAG pipeline without OpenAI
Akhilesh
Akhilesh

Posted on

Building a RAG pipeline without OpenAI

RAG stands for Retrieval Augmented Generation. The idea is simple: before your model answers a question, it first searches a database of relevant knowledge and uses that information to answer better.

I built this entirely without OpenAI — my own embedding model, my own vector database, my own retrieval logic.

Why RAG matters

Without RAG, your model answers purely from what it learned during training. It might hallucinate, it might be outdated, it can't cite sources.

With RAG:

  1. Question comes in
  2. System searches a database for relevant facts
  3. Those facts get added to the prompt
  4. Model answers using the retrieved context

Think of it like the difference between a doctor answering from memory versus a doctor who can look up references before answering.

The three components

1. Embedder — converts text to vectors

I used pritamdeka/S-PubMedBert-MS-MARCO — a sentence transformer specifically trained on medical and scientific text. It converts any text into a list of 768 numbers that represent its meaning.

"Meningitis causes fever and neck stiffness" → [0.23, -0.41, 0.87, ...]

Similar sentences produce similar vectors. This is what makes search work.

from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("pritamdeka/S-PubMedBert-MS-MARCO")
embedding = embedder.encode("Patient has fever and stiff neck")
Enter fullscreen mode Exit fullscreen mode

2. ChromaDB — stores and searches vectors

ChromaDB is a local vector database. I embedded 2,000 medical Q&A pairs and stored them. When a new question comes in, ChromaDB finds the most similar stored examples in milliseconds.

import chromadb
client = chromadb.PersistentClient(path="data/embeddings")
collection = client.create_collection("medical_knowledge")

# Store
collection.add(embeddings=embeddings, documents=texts, ids=ids)

# Search
results = collection.query(query_embeddings=[query_vec], n_results=3)
Enter fullscreen mode Exit fullscreen mode

3. Pipeline — connects everything

The full flow:

def answer(question):
    # 1. Retrieve relevant knowledge
    docs = retrieve(question, top_k=3)

    # 2. Build context
    context = "\n".join([d['content'] for d in docs])

    # 3. Build prompt with context injected
    prompt = f"Medical references:\n{context}\n\nQuestion: {question}\nAnswer:"

    # 4. Generate using fine-tuned model
    output = model.generate(prompt)
    return output, docs
Enter fullscreen mode Exit fullscreen mode

What I learned

The quality of your knowledge base matters as much as the model. I made a mistake early on — I stored MCQ training text as knowledge, which caused the retrieval to return irrelevant results. Clean, factual knowledge chunks work much better.

RAG is not magic. It helps when your knowledge base is relevant and clean. It hurts when it isn't.

Top comments (0)