Beyond Basic RAG: Learn LangChain + RAG End-to-End 🚀
Introduction
Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.
Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:
Hallucination
Hallucination means:
The model confidently generates incorrect information.
Example:
Question:
Who is the CEO of my company?
Without access to your internal company data, an LLM may generate a completely wrong answer.
This is where RAG (Retrieval-Augmented Generation) becomes useful.
Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.
What is RAG?
RAG stands for:
Retrieval-Augmented Generation
Instead of:
Question → LLM → Answer
We do:
Question
↓
Retrieve Relevant Documents
↓
Provide Context to LLM
↓
Generate Grounded Response
This makes responses:
✅ More accurate
✅ Context-aware
✅ Less hallucinated
✅ Enterprise-ready
Complete RAG Architecture
Documents (PDFs, DOCX, TXT)
↓
Document Loading
↓
Chunking
↓
Embeddings
↓
Vector Database
↓
Similarity Search
↓
Reranking
↓
Context Building
↓
LLM
↓
Final Answer
↓
Monitoring & Evaluation
Required Installation
Before starting, install all dependencies.
pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv
Project Structure
project/
│
├── data/
│ ├── pdf/
│ └── text/
│
├── .env
├── rag_pipeline.py
└── requirements.txt
Environment Variables (.env)
Never hardcode API keys.
Create a .env file.
NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o
LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com
1. Understanding LangChain Document Structure
LangChain stores documents in a standardized format.
A document contains:
- page_content
- metadata
page_content
This contains actual text.
Example:
page_content = "Generative AI is growing rapidly."
metadata
Metadata stores additional information.
Examples:
- file name
- author
- created date
- source
- page number
Creating a LangChain Document
Import
from langchain_core.documents import Document
Code
from langchain_core.documents import Document
doc = Document(
page_content="""
Generative AI is a subset of Artificial Intelligence
focused on creating content.
""",
metadata={
"source": "genai.pdf",
"author": "Sridhar",
"pages": 10
}
)
print(doc)
Output
Document(
page_content='Generative AI...',
metadata={
'source': 'genai.pdf',
'author': 'Sridhar',
'pages': 10
}
)
Why metadata matters?
In enterprise AI:
You often want:
“Show answer from document X page 5”
Metadata helps with traceability.
2. Loading Documents
Before processing documents, we must load them.
LangChain provides multiple loaders.
TextLoader
Used for:
-
.txtfiles - plain text files
Import
from langchain_community.document_loaders import TextLoader
Example
loader = TextLoader(
"data/text/sample.txt",
encoding="utf-8"
)
documents = loader.load()
print(documents)
DirectoryLoader
Loads multiple files from a folder.
Useful when:
You have:
100 PDFs
50 TXT files
many documents
Import
from langchain_community.document_loaders import DirectoryLoader
Example
loader = DirectoryLoader(
"data/text",
glob="*.txt",
loader_cls=TextLoader,
loader_kwargs={
"encoding":"utf-8"
}
)
documents = loader.load()
print(documents)
PDF Loader
Most enterprise RAG systems use PDFs.
LangChain supports:
PyPDFLoader
Simple and fast.
Import
from langchain_community.document_loaders import PyPDFLoader
Example
loader = PyPDFLoader(
"data/pdf/rag_guide.pdf"
)
documents = loader.load()
print(documents[0])
Each page becomes:
Document(
page_content="Page text",
metadata={"page":1}
)
3. Chunking Documents
Chunking is one of the most important parts of RAG.
Why?
Because LLMs have token limits.
You cannot send:
500 page PDF
to GPT.
Instead:
We split documents into smaller chunks.
Why Chunking Matters?
Bad chunking causes:
❌ poor retrieval
❌ hallucination
❌ context loss
Good chunking improves:
✅ retrieval quality
✅ relevance
✅ accuracy
RecursiveCharacterTextSplitter
Most commonly used splitter.
Import
from langchain_text_splitters import (
RecursiveCharacterTextSplitter
)
Code
text_splitter = (
RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=len,
separators=[
"\n\n",
"\n",
" ",
""
]
)
)
chunks = text_splitter.split_documents(
documents
)
print(len(chunks))
Parameters Explained
chunk_size
How large each chunk should be.
Example:
chunk_size=500
means:
500 characters per chunk.
chunk_overlap
Prevents context loss.
Example:
Chunk 1:
Artificial Intelligence is...
Chunk 2 starts with:
Intelligence is...
This preserves continuity.
Best Practices
Recommended:
chunk_size = 300–800
chunk_overlap = 30–100
for most enterprise RAG systems.
4. Understanding Embeddings
Once chunking is completed, we need to convert text into a format machines can understand.
LLMs understand:
Numbers (Vectors)
Not raw text.
This is where Embeddings come in.
What are Embeddings?
Embeddings convert text into numerical vector representations.
Example:
Text:
"Artificial Intelligence"
becomes:
[0.24, -0.76, 0.88, ....]
These vectors help us find:
Semantic Meaning
Example:
What is AI?
and
Explain Artificial Intelligence
have similar meanings.
Embedding models place them close together in vector space.
Why Embeddings are Important in RAG?
Without embeddings:
Search becomes:
Keyword matching
Example:
Searching:
CEO
Only returns exact keyword matches.
With embeddings:
Search becomes:
Semantic Search
Meaning-based retrieval.
Even if wording differs.
NVIDIA Embeddings
We will use:
NVIDIA Llama Nemotron Embedding Model
Advantages:
✅ Fast
✅ High-quality embeddings
✅ Good semantic understanding
✅ Free developer tier
Import Required Libraries
import os
from dotenv import load_dotenv
from langchain_nvidia_ai_endpoints import (
NVIDIAEmbeddings
)
Load Environment Variables
load_dotenv()
Initialize Embedding Model
embedding_model = (
NVIDIAEmbeddings(
model=
"nvidia/llama-nemotron-embed-vl-1b-v2",
nvidia_api_key=
os.getenv(
"NVIDIA_API_KEY"
)
)
)
Convert Chunks into Embeddings
Before embedding:
We only need:
page_content
from chunks.
Extract Text
texts = [
chunk.page_content
for chunk in chunks
]
Generate Embeddings
embedded_vectors = (
embedding_model.embed_documents(
texts
)
)
Check Embedding Dimension
print(
len(
embedded_vectors
)
)
print(
len(
embedded_vectors[0]
)
)
Output:
50
2048
Meaning:
50 chunks
2048 dimensional vector
Query Embedding
User questions also need embeddings.
Example:
query = (
"What is RAG?"
)
query_embedding = (
embedding_model.embed_query(
query
)
)
Now query and document vectors can be compared.
5. Vector Databases (Milvus)
Imagine storing:
Millions of embeddings
in SQL.
Very slow.
Traditional databases are not optimized for:
Similarity Search
We need:
Vector Database
Examples:
- Pinecone
- FAISS
- Chroma
- Milvus
- Weaviate
We will use:
Milvus
Why?
✅ Fast retrieval
✅ Open-source
✅ Enterprise-ready
✅ Optimized for vectors
Install Milvus
pip install pymilvus
Import Milvus
from pymilvus import (
MilvusClient
)
Create Milvus Connection
client = MilvusClient(
uri="milvus_demo.db"
)
print(
"Connected Successfully"
)
Create Collection
A collection is like:
SQL Table
for vector data.
Create Collection
try:
client.create_collection(
collection_name=
"rag_collection",
dimension=2048
)
print(
"Collection Created"
)
except Exception as e:
print(e)
Why Dimension Matters?
Embedding vector size:
2048
Collection dimension must match embedding dimension.
Otherwise:
Insertion will fail
Insert Data into Milvus
We store:
- ID
- Embedding vector
- Chunk text
Prepare Data
data = []
for i, (
chunk,
embedding
) in enumerate(
zip(
chunks,
embedded_vectors
)
):
data.append({
"id": i,
"vector":
embedding,
"text":
chunk.page_content
})
Insert into Collection
client.insert(
collection_name=
"rag_collection",
data=data
)
print(
"Inserted Successfully"
)
6. Similarity Retrieval
Now comes the real magic.
When user asks:
"What is RAG?"
We do:
- Convert query → embedding
- Search similar vectors
- Return relevant chunks
Generate Query Embedding
query = (
"What is RAG?"
)
query_embedding = (
embedding_model.embed_query(
query
)
)
Search in Milvus
results = client.search(
collection_name=
"rag_collection",
data=[
query_embedding
],
limit=5,
output_fields=[
"text"
]
)
Understanding Parameters
limit
How many chunks to retrieve.
Example:
limit=5
returns:
Top 5 relevant chunks
output_fields
Fields to return.
Example:
"text"
returns chunk text.
View Retrieved Chunks
for result in results[0]:
print(
result["entity"]
["text"]
)
print(
"----------------"
)
Problem with Similarity Search
Sometimes:
Top results are not the best.
Example:
Query:
What is RAG?
Retrieved:
Machine Learning
instead of:
Retrieval-Augmented Generation
This happens because:
Vector similarity is approximate.
Solution?
Reranking
7. Reranking
Reranking improves retrieval quality.
Instead of trusting:
Top K vectors
We re-score chunks.
Why Reranking Matters?
Without reranking:
Bad chunks may enter context.
Result:
❌ hallucination
❌ irrelevant answers
With reranking:
Only most relevant chunks are sent to LLM.
Import Reranker
from langchain_nvidia_ai_endpoints import (
NVIDIARerank
)
Initialize Reranker
reranker = (
NVIDIARerank(
nvidia_api_key=
os.getenv(
"NVIDIA_API_KEY"
)
)
)
Convert Milvus Results → Documents
Reranker expects:
LangChain Documents
not strings.
from langchain_core.documents import (
Document
)
retrieved_docs = [
Document(
page_content=
r["entity"]
["text"]
)
for r in results[0]
]
Run Reranking
reranked_docs = (
reranker.compress_documents(
documents=
retrieved_docs,
query=query
)
)
View Reranked Results
for doc in reranked_docs:
print(
doc.page_content
)
Now quality improves significantly.
8. Azure OpenAI Response Generation
Finally:
We generate answer.
Import Azure OpenAI
from langchain_openai import (
AzureChatOpenAI
)
Initialize LLM
llm = AzureChatOpenAI(
azure_endpoint=
os.getenv(
"AZURE_OPENAI_ENDPOINT"
),
api_key=
os.getenv(
"AZURE_OPENAI_KEY"
),
deployment_name=
"gpt-4o",
temperature=0.2
)
Why Low Temperature?
Lower:
temperature=0.2
means:
More factual answers.
Good for:
RAG systems
Build Context
context = "\n".join([
doc.page_content
for doc in reranked_docs
])
Prompt Engineering
prompt = f"""
Answer ONLY
from context.
Context:
{context}
Question:
{query}
"""
Strict prompt:
Prevents hallucination.
Generate Answer
response = llm.invoke(
prompt
)
print(
response.content
)
9. Langfuse Observability
Production AI systems require monitoring.
Questions:
Did retrieval work?
Did hallucination happen?
Was response relevant?
Langfuse solves this.
Install
pip install langfuse
Import
from langfuse import (
Langfuse
)
Initialize Langfuse
langfuse = Langfuse(
public_key=
os.getenv(
"LANGFUSE_PUBLIC_KEY"
),
secret_key=
os.getenv(
"LANGFUSE_SECRET_KEY"
),
host=
os.getenv(
"LANGFUSE_BASE_URL"
)
)
Log Retrieval
langfuse.create_event(
name="retrieval",
input={
"query":
query
},
output={
"chunks":
context
}
)
10. RAG Evaluation
We evaluate:
Retrieval Quality
Were chunks relevant?
Faithfulness
Was answer grounded?
Hallucination Score
Did model invent information?
Answer Relevance
Did answer actually solve query?
Example evaluation prompt:
evaluation_prompt = f"""
Evaluate:
Question:
{query}
Answer:
{response.content}
Context:
{context}
Score:
1. faithfulness
2. hallucination
3. relevance
"""
Production RAG Pipeline
PDFs
↓
Loaders
↓
Chunking
↓
Embeddings
↓
Milvus
↓
Retrieval
↓
Reranking
↓
Prompt Building
↓
GPT-4o
↓
Answer
↓
Langfuse Monitoring
↓
Evaluation
Common Challenges
Bad Retrieval
Fix:
✅ Better chunking
✅ Reranking
✅ Hybrid Search
Hallucination
Fix:
✅ Strict prompts
✅ Low temperature
✅ Better retrieval
Large PDFs
Fix:
✅ Chunking strategy
✅ Metadata filtering
Advanced RAG Techniques
Multi-Vector Retrieval
One chunk → multiple embeddings.
Better retrieval.
HyDE
Generate hypothetical answer first.
Then search.
RAPTOR
Hierarchical retrieval tree.
Better long document understanding.
Semantic Routing
Route query dynamically.
ColBERT
Token-level retrieval.
Highly accurate.
Final Thoughts
Basic RAG:
Retrieve → Generate
Production RAG:
Retrieve
→ Rerank
→ Evaluate
→ Monitor
→ Improve
That is how enterprise AI systems are built 🚀

Top comments (0)