`
Retrieval-Augmented Generation (RAG) has moved far beyond simple chat-over-PDF demos. In 2026, if your RAG system hallucinates on important queries, returns irrelevant chunks, or costs a fortune to run, it won't survive production.
This tutorial walks you through building a reliable, evaluable, and scalable RAG pipeline that you can actually put behind an API or in a product. We'll use your own documents (PDFs, Markdown, text files, etc.) and focus on the parts that actually matter in real deployments: smart chunking, hybrid retrieval, reranking, evaluation, and basic guardrails.
Why Most RAG Projects Fail in Production
- Bad chunking destroys context.
- Pure vector search misses exact keywords.
- No evaluation = you have no idea if it's improving.
- No reranking or metadata filtering = noisy results.
- No separation between indexing and querying pipelines.
We'll address all of these.
Tech Stack (2026 Edition – Balanced & Practical)
- Orchestration: LangChain (flexible) or LlamaIndex (stronger for document-heavy RAG). I'll use LangChain here.
-
Embeddings:
text-embedding-3-large(OpenAI) or open-source alternatives like Snowflake Arctic Embed. - Vector Store: Chroma (dev) → Qdrant or Weaviate (production).
- LLM: Grok, Claude, GPT-4o, or local with Ollama.
- Reranking: Cohere Rerank or BGE reranker.
- Evaluation: Ragas.
Prerequisites
pip install langchain langchain-community langchain-openai langchain-qdrant \
pypdf sentence-transformers chromadb ragas cohere
Step 1: Document Loading & Cleaning
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFDirectoryLoader("your_documents_folder/")
docs = loader.load()
print(f"Loaded {len(docs)} documents")
Step 2: Strategic Chunking
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(docs)
Step 3: Embeddings & Vector Store
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
client = QdrantClient(":memory:")
vector_store = QdrantVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
client=client,
collection_name="my_knowledge_base"
)
Step 4: Retrieval with Reranking
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
retriever = vector_store.as_retriever(search_kwargs={"k": 20})
compressor = CrossEncoderReranker(
model=HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-large"),
top_n=5
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever
)
Step 5: The RAG Chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o", temperature=0.0)
template = """Answer the question based only on the following context.
If you don't know the answer, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
print(rag_chain.invoke("What are the key points from the Q3 report?"))
Step 6: Evaluation with Ragas
Use Ragas to measure faithfulness, answer relevancy, context precision, and recall on a test dataset of questions and ground truth answers.
Going Production-Ready
- Separate indexing and querying pipelines
- Add semantic caching to reduce costs
- Implement guardrails (e.g., Guardrails AI or NeMo)
- Set up monitoring with LangSmith, Phoenix, or Prometheus
- Deploy using FastAPI with async endpoints
- Build a proper re-indexing strategy for fresh documents
Final Thoughts
Building a basic RAG takes an afternoon. Building one that stays accurate, cheap, and trustworthy at scale takes discipline around retrieval quality and continuous evaluation.
Start small: load your documents, get decent retrieval, add evaluation, then iterate based on real metrics — not gut feel.
The code above gives you a solid foundation you can extend today. Drop your documents in a folder and start experimenting.
Happy building!
Have you built a production RAG system? What was the biggest surprise or pain point? Share your experiences in the comments.
`
Top comments (0)