DEV Community

Ayesha
Ayesha

Posted on

Build a RAG System in Python (Without Overcomplicating It)

A few months ago, I built a chatbot that sounded very smart…

Until it started confidently giving completely wrong answers.

It hallucinated:

Product details that didn’t exist
Outdated policies
Even made-up information

That’s when I realized something important:

LLMs are great at reasoning
But terrible at remembering accurate, up-to-date facts

That’s exactly where RAG (Retrieval-Augmented Generation) comes in.

What is a RAG System (In Simple Terms)?

Instead of relying on memory, a RAG system:

Retrieves relevant data
Feeds it to the model
Generates an answer based on real context

Think of it like:

Closed-book exam → LLM alone
Open-book exam → RAG system

The Core Architecture

A basic RAG pipeline looks like this:

Documents → Chunking → Embeddings → Vector DB

User Query → Retrieval → LLM → Answer

The key idea:
The model doesn’t guess — it looks things up first

Minimal Working Example (Python)

Let’s build a simple version step-by-step.

  1. Install Dependencies pip install sentence-transformers faiss-cpu openai
  2. Sample Data documents = [ "Refunds are allowed within 30 days.", "Shipping takes 3-5 business days.", "We support Visa and PayPal.", "Support is available 24/7." ]
  3. Create Embeddings from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents)

  1. Store in FAISS import faiss import numpy as np

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

  1. Retrieval Function def retrieve(query, k=2): query_embedding = model.encode([query]) distances, indices = index.search(query_embedding, k) return [documents[i] for i in indices[0]]
  2. Generate Answer import openai

openai.api_key = "YOUR_API_KEY"

def rag_query(question):
context = retrieve(question)
prompt = f"""
Answer using only this context:
{chr(10).join(context)}
Question: {question}
"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content

  1. Test It print(rag_query("What is the refund policy?"))

Boom — you’ve built a basic RAG system.

Where Most RAG Tutorials Fall Short

This is where things get interesting.

Most examples stop here — but real systems fail because of:

Poor chunking
No overlap between chunks
Weak retrieval
No reranking
Blind trust in results
Want the Full Production-Level Breakdown?

If you want a complete step-by-step guide (including LangChain, improvements, and scaling tips), check this:

How to Build a RAG System (Step-by-Step Guide)

How to Actually Improve Your RAG System

If you're serious about building something real:

✅ Better Chunking
500–800 tokens
Add 10–20% overlap
✅ Hybrid Search

Combine:

Semantic search (embeddings)
Keyword search (BM25)
✅ Use Better Vector DBs
FAISS → Learning
ChromaDB → Intermediate
Pinecone / Qdrant → Production
✅ Add Reranking

Use a second model to refine retrieved results.

Common Mistakes

Avoid these early:

❌ Tiny chunks → bad context
❌ No overlap → broken answers
❌ Ignoring retrieval quality
❌ Over-relying on LLM

When Should You Use RAG?

Use it when:

Data changes frequently
Accuracy matters
You need grounded answers

Skip it when:

You only need tone/style control

Top comments (0)