vapmail16

Posted on Feb 24

RAG From Scratch: Build a System That Answers Questions From Your Docs

#ai #python #rag #tutorial

My first RAG system answered "I don't know" to questions that were clearly in the documents. The information was right there — paragraph three, page seven — and the AI couldn't find it.
Turns out, my chunking strategy was destroying context. I was splitting documents every 1,000 characters like every tutorial told me to. The split landed in the middle of a sentence about quarterly revenue targets. The first half ended up in one chunk, the second half in another, and the embedding for each half was meaningless.
That was the moment I understood: RAG isn't a retrieval problem or a generation problem. It's an architecture problem. And most tutorials stop at step one.
Here's how to build a RAG system that actually works — from loading documents to generating accurate answers.

What RAG Actually Is
RAG stands for Retrieval-Augmented Generation. Instead of asking an LLM to answer from its training data (which might be outdated or wrong), you give it YOUR documents and let it answer from those.
The pipeline:
Documents → Chunk → Embed → Store in Vector DB →
User asks question → Embed question → Search for similar chunks →
Feed chunks + question to LLM → Generate answer
Simple in theory. The devil is in each arrow.

Step 1: Load and Chunk Your Documents
Loading is straightforward. Chunking is where most RAG systems break.
pythonfrom langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

Load documents

loader = DirectoryLoader('./docs', glob="*/.md", loader_cls=TextLoader)
documents = loader.load()

Bad chunking (what most tutorials teach)

bad_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0 # No overlap = lost context
)

Good chunking

good_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Smaller = more precise retrieval
chunk_overlap=100, # Overlap preserves context at boundaries
separators=["\n## ", "\n### ", "\n\n", "\n", " "] # Respect document structure
)

chunks = good_splitter.split_documents(documents)
Why 500 instead of 1,000? Smaller chunks mean more precise retrieval. When a user asks "What's the refund policy?", you want to retrieve the exact paragraph about refunds — not a 1,000-character block that's half refund policy and half shipping information.
Why overlap of 100? Because sentences at chunk boundaries get split. Overlapping by 100 characters means the end of chunk N appears at the start of chunk N+1. Context preserved.
Why respect separators? Splitting on headings and paragraphs keeps semantic units together. Splitting on character count doesn't care if it lands mid-sentence.

Step 2: Embed and Store
Turn each chunk into a vector embedding and store it in a vector database.
pythonfrom langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma # or Pinecone, Weaviate, pgvector

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Create vector store

vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
Embedding model choice matters. text-embedding-3-small is fast and cheap. For higher accuracy on technical content, text-embedding-3-large is worth the extra cost. For completely local/private data, consider open-source alternatives like sentence-transformers/all-MiniLM-L6-v2.

Step 3: Retrieval — Where Most Systems Fail
Naive retrieval: find the top-k most similar vectors to the user's question. This works for simple questions but fails badly when:

The question uses different words than the document (semantic gap)
Multiple chunks contain partial answers
The most similar chunk isn't the most relevant chunk

Here are three upgrades that dramatically improve retrieval:
Upgrade A: Hybrid Search (Vector + Keyword)
pythonfrom langchain.retrievers import BM25Retriever, EnsembleRetriever

Vector retriever (semantic similarity)

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

BM25 retriever (keyword matching)

bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10

Combine both

ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4] # Favor semantic, supplement with keywords
)
Why hybrid? Vector search finds semantically similar content. BM25 finds exact keyword matches. Together they catch what each misses alone. User asks "What is the PTO policy?" — vector search finds "vacation days and time off" while BM25 catches the exact term "PTO."
Upgrade B: HyDE (Hypothetical Document Embeddings)
Instead of embedding the question, generate a hypothetical answer first, then search with that.
pythonfrom langchain.chains import HypotheticalDocumentEmbedder

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
llm=ChatOpenAI(model="gpt-4o-mini"),
base_embeddings=embeddings,
prompt_key="web_search"
)
The intuition: a question like "What's the refund policy?" is semantically different from the answer paragraph that describes the actual policy. But a hypothetical answer ABOUT refund policies is much closer to the real document. HyDE bridges that gap.
Upgrade C: Re-ranking
Retrieve broadly (top 20), then re-rank to find the best 5.
pythonfrom langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker

reranker = CrossEncoderReranker(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
top_n=5
)

compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=vector_retriever # Retrieves 20
)
Why re-rank? Bi-encoder similarity (what vector search uses) is fast but approximate. Cross-encoder re-ranking is slower but far more accurate. Retrieve broadly, then re-rank precisely.

Step 4: Generation — The Prompt Matters
The difference between a hallucinating RAG system and an honest one is usually the system prompt.
pythonfrom langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

prompt = ChatPromptTemplate.from_template("""
Answer the user's question using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have
enough information to answer this question."
Never make up facts. Never extrapolate beyond the context.

Context:
{context}

Question: {question}

Answer:
""")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

Build the chain

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=compression_retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)

Ask a question

result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])
Three critical instructions in that prompt: (1) ONLY use the provided context, (2) admit when you don't know, (3) never make up facts. Remove any of these and your system will hallucinate confidently.

Step 5: Evaluate (The Step Everyone Skips)
Most RAG systems ship with zero evaluation. That's like deploying a web app without testing.
Three things to measure:
python# 1. Retrieval quality: did we fetch the right chunks?
def retrieval_precision(query, retrieved_docs, relevant_docs):
relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
return len(relevant_retrieved) / len(retrieved_docs)

2. Answer faithfulness: does the answer match the context?

(Use an LLM to verify — no hallucination beyond context)

3. Answer relevance: does the answer address the question?

(Use an LLM to score 1-5)

Build a test set of 20-50 question/answer pairs from your documents. Run your RAG against them. Track precision, faithfulness, and relevance over time. When you change chunking, retrieval, or prompts — re-run the eval.

RAG Architecture Comparison
Not all RAG is created equal:
ArchitectureHow It WorksBest ForNaive RAGTop-k vector search → generateSimple docs, getting startedHyDEHypothetical answer → search with thatWhen questions ≠ document languageHybridVector + BM25 keyword searchMost production systemsCorrective RAGCheck retrieval quality → retry if badHigh-accuracy requirementsGraph RAGKnowledge graph + vector searchComplex entity relationshipsAgentic RAGAgent decides retrieval strategyMulti-step reasoning
Most production systems should start with Hybrid + Re-ranking. It handles 80% of use cases well.

RAG Debugging Checklist
When your RAG isn't working, check these in order:
□ 1. Are chunks too large? (>1000 chars = probably too big)
□ 2. Is there chunk overlap? (0 overlap = context loss)
□ 3. Are you respecting document structure? (split on headings)
□ 4. Is retrieval returning relevant chunks? (print them!)
□ 5. Is your prompt explicit about "only use context"?
□ 6. Are you using temperature=0 for factual answers?
□ 7. Have you tried hybrid search? (vector + keyword)
□ 8. Do you have an evaluation set? (20+ Q&A pairs minimum)
If your RAG answers are mediocre, it's almost never the model. It's the retrieval. Fix retrieval first, always.
What's the hardest question your RAG system can't answer?

Tags: ai, rag, tutorial, python

DEV Community