Beyond Basic RAG: Learn LangChain + RAG End-to-End π
Introduction
Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.
Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:
Hallucination
Hallucination means:
The model confidently generates incorrect information.
Example:
Question:
Who is the CEO of my company?
Without access to your internal company data, an LLM may generate a completely wrong answer.
This is where RAG (Retrieval-Augmented Generation) becomes useful.
Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.
What is RAG?
RAG stands for:
Retrieval-Augmented Generation
Instead of:
Question β LLM β Answer
We do:
Question
β
Retrieve Relevant Documents
β
Provide Context to LLM
β
Generate Grounded Response
This makes responses:
β
More accurate
β
Context-aware
β
Less hallucinated
β
Enterprise-ready
Complete RAG Architecture
Documents (PDFs, DOCX, TXT)
β
Document Loading
β
Chunking
β
Embeddings
β
Vector Database
β
Similarity Search
β
Reranking
β
Context Building
β
LLM
β
Final Answer
β
Monitoring & Evaluation
Required Installation
Before starting, install all dependencies.
pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv
Project Structure
project/
β
βββ data/
β βββ pdf/
β βββ text/
β
βββ .env
βββ rag_pipeline.py
βββ requirements.txt
Environment Variables (.env)
Never hardcode API keys.
Create a .env file.
NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o
LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com
1. Understanding LangChain Document Structure
LangChain stores documents in a standardized format.
A document contains:
- page_content
- metadata
page_content
This contains actual text.
Example:
page_content = "Generative AI is growing rapidly."
metadata
Metadata stores additional information.
Examples:
- file name
- author
- created date
- source
- page number
Creating a LangChain Document
Import
from langchain_core.documents import Document
Code
from langchain_core.documents import Document
doc = Document(
page_content="""
Generative AI is a subset of Artificial Intelligence
focused on creating content.
""",
metadata={
"source": "genai.pdf",
"author": "Sridhar",
"pages": 10
}
)
print(doc)
Output
Document(
page_content='Generative AI...',
metadata={
'source': 'genai.pdf',
'author': 'Sridhar',
'pages': 10
}
)
Why metadata matters?
In enterprise AI:
You often want:
βShow answer from document X page 5β
Metadata helps with traceability.
2. Loading Documents
Before processing documents, we must load them.
LangChain provides multiple loaders.
TextLoader
Used for:
-
.txtfiles - plain text files
Import
from langchain_community.document_loaders import TextLoader
Example
loader = TextLoader(
"data/text/sample.txt",
encoding="utf-8"
)
documents = loader.load()
print(documents)
DirectoryLoader
Loads multiple files from a folder.
Useful when:
You have:
100 PDFs
50 TXT files
many documents
Import
from langchain_community.document_loaders import DirectoryLoader
Example
loader = DirectoryLoader(
"data/text",
glob="*.txt",
loader_cls=TextLoader,
loader_kwargs={
"encoding":"utf-8"
}
)
documents = loader.load()
print(documents)
PDF Loader
Most enterprise RAG systems use PDFs.
LangChain supports:
PyPDFLoader
Simple and fast.
Import
from langchain_community.document_loaders import PyPDFLoader
Example
loader = PyPDFLoader(
"data/pdf/rag_guide.pdf"
)
documents = loader.load()
print(documents[0])
Each page becomes:
Document(
page_content="Page text",
metadata={"page":1}
)
3. Chunking Documents
Chunking is one of the most important parts of RAG.
Why?
Because LLMs have token limits.
You cannot send:
500 page PDF
to GPT.
Instead:
We split documents into smaller chunks.
Why Chunking Matters?
Bad chunking causes:
β poor retrieval
β hallucination
β context loss
Good chunking improves:
β
retrieval quality
β
relevance
β
accuracy
RecursiveCharacterTextSplitter
Most commonly used splitter.
Import
from langchain_text_splitters import (
RecursiveCharacterTextSplitter
)
Code
text_splitter = (
RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=len,
separators=[
"\n\n",
"\n",
" ",
""
]
)
)
chunks = text_splitter.split_documents(
documents
)
print(len(chunks))
Parameters Explained
chunk_size
How large each chunk should be.
Example:
chunk_size=500
means:
500 characters per chunk.
chunk_overlap
Prevents context loss.
Example:
Chunk 1:
Artificial Intelligence is...
Chunk 2 starts with:
Intelligence is...
This preserves continuity.
Best Practices
Recommended:
chunk_size = 300β800
chunk_overlap = 30β100
for most enterprise RAG systems.
4. Understanding Embeddings
Once chunking is completed, we need to convert text into a format machines can understand.
LLMs understand:
Numbers (Vectors)
Not raw text.
This is where Embeddings come in.
What are Embeddings?
Embeddings convert text into numerical vector representations.
Example:
Text:
"Artificial Intelligence"
becomes:
[0.24, -0.76, 0.88, ....]
These vectors help us find:
Semantic Meaning
Example:
What is AI?
and
Explain Artificial Intelligence
have similar meanings.
Embedding models place them close together in vector space.
Why Embeddings are Important in RAG?
Without embeddings:
Search becomes:
Keyword matching
Example:
Searching:
CEO
Only returns exact keyword matches.
With embeddings:
Search becomes:
Semantic Search
Meaning-based retrieval.
Even if wording differs.
NVIDIA Embeddings
We will use:
NVIDIA Llama Nemotron Embedding Model
Advantages:
β
Fast
β
High-quality embeddings
β
Good semantic understanding
β
Free developer tier
Import Required Libraries
import os
from dotenv import load_dotenv
from langchain_nvidia_ai_endpoints import (
NVIDIAEmbeddings
)
Load Environment Variables
load_dotenv()
Initialize Embedding Model
embedding_model = (
NVIDIAEmbeddings(
model=
"nvidia/llama-nemotron-embed-vl-1b-v2",
nvidia_api_key=
os.getenv(
"NVIDIA_API_KEY"
)
)
)
Convert Chunks into Embeddings
Before embedding:
We only need:
page_content
from chunks.
Extract Text
texts = [
chunk.page_content
for chunk in chunks
]
Generate Embeddings
embedded_vectors = (
embedding_model.embed_documents(
texts
)
)
Check Embedding Dimension
print(
len(
embedded_vectors
)
)
print(
len(
embedded_vectors[0]
)
)
Output:
50
2048
Meaning:
50 chunks
2048 dimensional vector
Query Embedding
User questions also need embeddings.
Example:
query = (
"What is RAG?"
)
query_embedding = (
embedding_model.embed_query(
query
)
)
Now query and document vectors can be compared.
5. Vector Databases (Milvus)
Imagine storing:
Millions of embeddings
in SQL.
Very slow.
Traditional databases are not optimized for:
Similarity Search
We need:
Vector Database
Examples:
- Pinecone
- FAISS
- Chroma
- Milvus
- Weaviate
We will use:
Milvus
Why?
β
Fast retrieval
β
Open-source
β
Enterprise-ready
β
Optimized for vectors
Install Milvus
pip install pymilvus
Import Milvus
from pymilvus import (
MilvusClient
)
Create Milvus Connection
client = MilvusClient(
uri="milvus_demo.db"
)
print(
"Connected Successfully"
)
Create Collection
A collection is like:
SQL Table
for vector data.
Create Collection
try:
client.create_collection(
collection_name=
"rag_collection",
dimension=2048
)
print(
"Collection Created"
)
except Exception as e:
print(e)
Why Dimension Matters?
Embedding vector size:
2048
Collection dimension must match embedding dimension.
Otherwise:
Insertion will fail
Insert Data into Milvus
We store:
- ID
- Embedding vector
- Chunk text
Prepare Data
data = []
for i, (
chunk,
embedding
) in enumerate(
zip(
chunks,
embedded_vectors
)
):
data.append({
"id": i,
"vector":
embedding,
"text":
chunk.page_content
})
Insert into Collection
client.insert(
collection_name=
"rag_collection",
data=data
)
print(
"Inserted Successfully"
)
6. Similarity Retrieval
Now comes the real magic.
When user asks:
"What is RAG?"
We do:
- Convert query β embedding
- Search similar vectors
- Return relevant chunks
Generate Query Embedding
query = (
"What is RAG?"
)
query_embedding = (
embedding_model.embed_query(
query
)
)
Search in Milvus
results = client.search(
collection_name=
"rag_collection",
data=[
query_embedding
],
limit=5,
output_fields=[
"text"
]
)
Understanding Parameters
limit
How many chunks to retrieve.
Example:
limit=5
returns:
Top 5 relevant chunks
output_fields
Fields to return.
Example:
"text"
returns chunk text.
View Retrieved Chunks
for result in results[0]:
print(
result["entity"]
["text"]
)
print(
"----------------"
)
Problem with Similarity Search
Sometimes:
Top results are not the best.
Example:
Query:
What is RAG?
Retrieved:
Machine Learning
instead of:
Retrieval-Augmented Generation
This happens because:
Vector similarity is approximate.
Solution?
Reranking
7. Reranking
Reranking improves retrieval quality.
Instead of trusting:
Top K vectors
We re-score chunks.
Why Reranking Matters?
Without reranking:
Bad chunks may enter context.
Result:
β hallucination
β irrelevant answers
With reranking:
Only most relevant chunks are sent to LLM.
Import Reranker
from langchain_nvidia_ai_endpoints import (
NVIDIARerank
)
Initialize Reranker
reranker = (
NVIDIARerank(
nvidia_api_key=
os.getenv(
"NVIDIA_API_KEY"
)
)
)
Convert Milvus Results β Documents
Reranker expects:
LangChain Documents
not strings.
from langchain_core.documents import (
Document
)
retrieved_docs = [
Document(
page_content=
r["entity"]
["text"]
)
for r in results[0]
]
Run Reranking
reranked_docs = (
reranker.compress_documents(
documents=
retrieved_docs,
query=query
)
)
View Reranked Results
for doc in reranked_docs:
print(
doc.page_content
)
Now quality improves significantly.
8. Azure OpenAI Response Generation
Finally:
We generate answer.
Import Azure OpenAI
from langchain_openai import (
AzureChatOpenAI
)
Initialize LLM
llm = AzureChatOpenAI(
azure_endpoint=
os.getenv(
"AZURE_OPENAI_ENDPOINT"
),
api_key=
os.getenv(
"AZURE_OPENAI_KEY"
),
deployment_name=
"gpt-4o",
temperature=0.2
)
Why Low Temperature?
Lower:
temperature=0.2
means:
More factual answers.
Good for:
RAG systems
Build Context
context = "\n".join([
doc.page_content
for doc in reranked_docs
])
Prompt Engineering
prompt = f"""
Answer ONLY
from context.
Context:
{context}
Question:
{query}
"""
Strict prompt:
Prevents hallucination.
Generate Answer
response = llm.invoke(
prompt
)
print(
response.content
)
9. Langfuse Observability
Production AI systems require monitoring.
Questions:
Did retrieval work?
Did hallucination happen?
Was response relevant?
Langfuse solves this.
Install
pip install langfuse
Import
from langfuse import (
Langfuse
)
Initialize Langfuse
langfuse = Langfuse(
public_key=
os.getenv(
"LANGFUSE_PUBLIC_KEY"
),
secret_key=
os.getenv(
"LANGFUSE_SECRET_KEY"
),
host=
os.getenv(
"LANGFUSE_BASE_URL"
)
)
Log Retrieval
langfuse.create_event(
name="retrieval",
input={
"query":
query
},
output={
"chunks":
context
}
)
10. RAG Evaluation
We evaluate:
Retrieval Quality
Were chunks relevant?
Faithfulness
Was answer grounded?
Hallucination Score
Did model invent information?
Answer Relevance
Did answer actually solve query?
Example evaluation prompt:
evaluation_prompt = f"""
Evaluate:
Question:
{query}
Answer:
{response.content}
Context:
{context}
Score:
1. faithfulness
2. hallucination
3. relevance
"""
Production RAG Pipeline
PDFs
β
Loaders
β
Chunking
β
Embeddings
β
Milvus
β
Retrieval
β
Reranking
β
Prompt Building
β
GPT-4o
β
Answer
β
Langfuse Monitoring
β
Evaluation
Common Challenges
Bad Retrieval
Fix:
β
Better chunking
β
Reranking
β
Hybrid Search
Hallucination
Fix:
β
Strict prompts
β
Low temperature
β
Better retrieval
Large PDFs
Fix:
β
Chunking strategy
β
Metadata filtering
Advanced RAG Techniques
Multi-Vector Retrieval
One chunk β multiple embeddings.
Better retrieval.
HyDE
Generate hypothetical answer first.
Then search.
RAPTOR
Hierarchical retrieval tree.
Better long document understanding.
Semantic Routing
Route query dynamically.
ColBERT
Token-level retrieval.
Highly accurate.
Final Thoughts
Basic RAG:
Retrieve β Generate
Production RAG:
Retrieve
β Rerank
β Evaluate
β Monitor
β Improve
That is how enterprise AI systems are built π

Top comments (1)
The reranking step is the part of this pipeline I'd point readers at, because it's the highest-leverage and most-skipped piece. Plain vector retrieval optimizes for embedding similarity, which is a decent first-pass filter but routinely ranks a surface-similar-but-irrelevant chunk above the actually-correct one. A reranker (cross-encoder scoring query+chunk together) fixes exactly that, and on most RAG systems adding rerank moves answer quality more than swapping the LLM does. Including it in an end-to-end pipeline instead of stopping at "embed and top-k" is what makes this production-shaped rather than a demo.
The thing I'd build on top: the abstain path. Even with Milvus + rerank, sometimes the right answer just isn't in the corpus, and the trustworthy system says "not supported" instead of letting Azure OpenAI generate a confident fill. That retrieve-rerank-then-verify discipline is core to how I build Moonshift, the thing I work on - a multi-agent pipeline that takes a prompt to a deployed SaaS, where a verify layer gates output against the retrieved evidence rather than trusting the model. Multi-model routing keeps a build ~$3 flat, first run free no card. Solid end-to-end writeup. What reranker did you use, and how much did it move retrieval quality vs vector-only? In my experience rerank is the single biggest quality lever in the whole stack.