DEV Community

Sridhar S
Sridhar S

Posted on

Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀

Beyond Basic RAG: Learn LangChain + RAG End-to-End 🚀


Introduction

Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.

Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:

Hallucination

Hallucination means:

The model confidently generates incorrect information.

Example:

Question:

Who is the CEO of my company?

Without access to your internal company data, an LLM may generate a completely wrong answer.

This is where RAG (Retrieval-Augmented Generation) becomes useful.

Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.


What is RAG?

RAG stands for:

Retrieval-Augmented Generation

Instead of:

Question → LLM → Answer
Enter fullscreen mode Exit fullscreen mode

We do:

Question
   ↓
Retrieve Relevant Documents
   ↓
Provide Context to LLM
   ↓
Generate Grounded Response
Enter fullscreen mode Exit fullscreen mode

This makes responses:

✅ More accurate

✅ Context-aware

✅ Less hallucinated

✅ Enterprise-ready


Complete RAG Architecture

Documents (PDFs, DOCX, TXT)
            ↓
      Document Loading
            ↓
         Chunking
            ↓
         Embeddings
            ↓
      Vector Database
            ↓
      Similarity Search
            ↓
         Reranking
            ↓
       Context Building
            ↓
            LLM
            ↓
         Final Answer
            ↓
     Monitoring & Evaluation
Enter fullscreen mode Exit fullscreen mode

Required Installation

Before starting, install all dependencies.

pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv
Enter fullscreen mode Exit fullscreen mode

Project Structure

project/
│
├── data/
│   ├── pdf/
│   └── text/
│
├── .env
├── rag_pipeline.py
└── requirements.txt
Enter fullscreen mode Exit fullscreen mode

Environment Variables (.env)

Never hardcode API keys.

Create a .env file.

NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o

LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com
Enter fullscreen mode Exit fullscreen mode

1. Understanding LangChain Document Structure

LangChain stores documents in a standardized format.

A document contains:

  1. page_content
  2. metadata

page_content

This contains actual text.

Example:

page_content = "Generative AI is growing rapidly."
Enter fullscreen mode Exit fullscreen mode

metadata

Metadata stores additional information.

Examples:

  • file name
  • author
  • created date
  • source
  • page number

Creating a LangChain Document

Import

from langchain_core.documents import Document
Enter fullscreen mode Exit fullscreen mode

Code

from langchain_core.documents import Document

doc = Document(
    page_content="""
    Generative AI is a subset of Artificial Intelligence
    focused on creating content.
    """,
    metadata={
        "source": "genai.pdf",
        "author": "Sridhar",
        "pages": 10
    }
)

print(doc)
Enter fullscreen mode Exit fullscreen mode

Output

Document(
    page_content='Generative AI...',
    metadata={
        'source': 'genai.pdf',
        'author': 'Sridhar',
        'pages': 10
    }
)
Enter fullscreen mode Exit fullscreen mode

Why metadata matters?

In enterprise AI:

You often want:

“Show answer from document X page 5”

Metadata helps with traceability.


2. Loading Documents

Before processing documents, we must load them.

LangChain provides multiple loaders.


TextLoader

Used for:

  • .txt files
  • plain text files

Import

from langchain_community.document_loaders import TextLoader
Enter fullscreen mode Exit fullscreen mode

Example

loader = TextLoader(
    "data/text/sample.txt",
    encoding="utf-8"
)

documents = loader.load()

print(documents)
Enter fullscreen mode Exit fullscreen mode

DirectoryLoader

Loads multiple files from a folder.

Useful when:

You have:

100 PDFs
50 TXT files
many documents
Enter fullscreen mode Exit fullscreen mode

Import

from langchain_community.document_loaders import DirectoryLoader
Enter fullscreen mode Exit fullscreen mode

Example

loader = DirectoryLoader(
    "data/text",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={
        "encoding":"utf-8"
    }
)

documents = loader.load()

print(documents)
Enter fullscreen mode Exit fullscreen mode

PDF Loader

Most enterprise RAG systems use PDFs.

LangChain supports:

PyPDFLoader

Simple and fast.

Import

from langchain_community.document_loaders import PyPDFLoader
Enter fullscreen mode Exit fullscreen mode

Example

loader = PyPDFLoader(
    "data/pdf/rag_guide.pdf"
)

documents = loader.load()

print(documents[0])
Enter fullscreen mode Exit fullscreen mode

Each page becomes:

Document(
    page_content="Page text",
    metadata={"page":1}
)
Enter fullscreen mode Exit fullscreen mode

3. Chunking Documents

Chunking is one of the most important parts of RAG.

Why?

Because LLMs have token limits.

You cannot send:

500 page PDF
Enter fullscreen mode Exit fullscreen mode

to GPT.

Instead:

We split documents into smaller chunks.


Why Chunking Matters?

Bad chunking causes:

❌ poor retrieval

❌ hallucination

❌ context loss

Good chunking improves:

✅ retrieval quality

✅ relevance

✅ accuracy


RecursiveCharacterTextSplitter

Most commonly used splitter.

Import

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter
)
Enter fullscreen mode Exit fullscreen mode

Code

text_splitter = (
    RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len,
        separators=[
            "\n\n",
            "\n",
            " ",
            ""
        ]
    )
)

chunks = text_splitter.split_documents(
    documents
)

print(len(chunks))
Enter fullscreen mode Exit fullscreen mode

Parameters Explained

chunk_size

How large each chunk should be.

Example:

chunk_size=500
Enter fullscreen mode Exit fullscreen mode

means:

500 characters per chunk.


chunk_overlap

Prevents context loss.

Example:

Chunk 1:

Artificial Intelligence is...
Enter fullscreen mode Exit fullscreen mode

Chunk 2 starts with:

Intelligence is...
Enter fullscreen mode Exit fullscreen mode

This preserves continuity.


Best Practices

Recommended:

chunk_size = 300800
chunk_overlap = 30100
Enter fullscreen mode Exit fullscreen mode

for most enterprise RAG systems.

4. Understanding Embeddings

Once chunking is completed, we need to convert text into a format machines can understand.

LLMs understand:

Numbers (Vectors)
Enter fullscreen mode Exit fullscreen mode

Not raw text.

This is where Embeddings come in.


What are Embeddings?

Embeddings convert text into numerical vector representations.

Example:

Text:

"Artificial Intelligence"
Enter fullscreen mode Exit fullscreen mode

becomes:

[0.24, -0.76, 0.88, ....]
Enter fullscreen mode Exit fullscreen mode

These vectors help us find:

Semantic Meaning

Example:

What is AI?
Enter fullscreen mode Exit fullscreen mode

and

Explain Artificial Intelligence
Enter fullscreen mode Exit fullscreen mode

have similar meanings.

Embedding models place them close together in vector space.


Why Embeddings are Important in RAG?

Without embeddings:

Search becomes:

Keyword matching
Enter fullscreen mode Exit fullscreen mode

Example:

Searching:

CEO
Enter fullscreen mode Exit fullscreen mode

Only returns exact keyword matches.

With embeddings:

Search becomes:

Semantic Search
Enter fullscreen mode Exit fullscreen mode

Meaning-based retrieval.

Even if wording differs.


NVIDIA Embeddings

We will use:

NVIDIA Llama Nemotron Embedding Model
Enter fullscreen mode Exit fullscreen mode

Advantages:

✅ Fast

✅ High-quality embeddings

✅ Good semantic understanding

✅ Free developer tier


Import Required Libraries

import os

from dotenv import load_dotenv

from langchain_nvidia_ai_endpoints import (
    NVIDIAEmbeddings
)
Enter fullscreen mode Exit fullscreen mode

Load Environment Variables

load_dotenv()
Enter fullscreen mode Exit fullscreen mode

Initialize Embedding Model

embedding_model = (
    NVIDIAEmbeddings(
        model=
        "nvidia/llama-nemotron-embed-vl-1b-v2",

        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)
Enter fullscreen mode Exit fullscreen mode

Convert Chunks into Embeddings

Before embedding:

We only need:

page_content
Enter fullscreen mode Exit fullscreen mode

from chunks.

Extract Text

texts = [
    chunk.page_content
    for chunk in chunks
]
Enter fullscreen mode Exit fullscreen mode

Generate Embeddings

embedded_vectors = (
    embedding_model.embed_documents(
        texts
    )
)
Enter fullscreen mode Exit fullscreen mode

Check Embedding Dimension

print(
    len(
        embedded_vectors
    )
)

print(
    len(
        embedded_vectors[0]
    )
)
Enter fullscreen mode Exit fullscreen mode

Output:

50
2048
Enter fullscreen mode Exit fullscreen mode

Meaning:

50 chunks
2048 dimensional vector
Enter fullscreen mode Exit fullscreen mode

Query Embedding

User questions also need embeddings.

Example:

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)
Enter fullscreen mode Exit fullscreen mode

Now query and document vectors can be compared.


5. Vector Databases (Milvus)

Imagine storing:

Millions of embeddings
Enter fullscreen mode Exit fullscreen mode

in SQL.

Very slow.

Traditional databases are not optimized for:

Similarity Search
Enter fullscreen mode Exit fullscreen mode

We need:

Vector Database

Examples:

  • Pinecone
  • FAISS
  • Chroma
  • Milvus
  • Weaviate

We will use:

Milvus

Why?

✅ Fast retrieval

✅ Open-source

✅ Enterprise-ready

✅ Optimized for vectors


Install Milvus

pip install pymilvus
Enter fullscreen mode Exit fullscreen mode

Import Milvus

from pymilvus import (
    MilvusClient
)
Enter fullscreen mode Exit fullscreen mode

Create Milvus Connection

client = MilvusClient(
    uri="milvus_demo.db"
)

print(
    "Connected Successfully"
)
Enter fullscreen mode Exit fullscreen mode

Create Collection

A collection is like:

SQL Table
Enter fullscreen mode Exit fullscreen mode

for vector data.


Create Collection

try:

    client.create_collection(
        collection_name=
        "rag_collection",

        dimension=2048
    )

    print(
        "Collection Created"
    )

except Exception as e:

    print(e)
Enter fullscreen mode Exit fullscreen mode

Why Dimension Matters?

Embedding vector size:

2048
Enter fullscreen mode Exit fullscreen mode

Collection dimension must match embedding dimension.

Otherwise:

Insertion will fail
Enter fullscreen mode Exit fullscreen mode

Insert Data into Milvus

We store:

  1. ID
  2. Embedding vector
  3. Chunk text

Prepare Data

data = []

for i, (
    chunk,
    embedding
) in enumerate(
    zip(
        chunks,
        embedded_vectors
    )
):

    data.append({

        "id": i,

        "vector":
        embedding,

        "text":
        chunk.page_content
    })
Enter fullscreen mode Exit fullscreen mode

Insert into Collection

client.insert(
    collection_name=
    "rag_collection",

    data=data
)

print(
    "Inserted Successfully"
)
Enter fullscreen mode Exit fullscreen mode

6. Similarity Retrieval

Now comes the real magic.

When user asks:

"What is RAG?"
Enter fullscreen mode Exit fullscreen mode

We do:

  1. Convert query → embedding
  2. Search similar vectors
  3. Return relevant chunks

Generate Query Embedding

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)
Enter fullscreen mode Exit fullscreen mode

Search in Milvus

results = client.search(

    collection_name=
    "rag_collection",

    data=[
        query_embedding
    ],

    limit=5,

    output_fields=[
        "text"
    ]
)
Enter fullscreen mode Exit fullscreen mode

Understanding Parameters

limit

How many chunks to retrieve.

Example:

limit=5
Enter fullscreen mode Exit fullscreen mode

returns:

Top 5 relevant chunks
Enter fullscreen mode Exit fullscreen mode

output_fields

Fields to return.

Example:

"text"
Enter fullscreen mode Exit fullscreen mode

returns chunk text.


View Retrieved Chunks

for result in results[0]:

    print(
        result["entity"]
        ["text"]
    )

    print(
        "----------------"
    )
Enter fullscreen mode Exit fullscreen mode

Problem with Similarity Search

Sometimes:

Top results are not the best.

Example:

Query:

What is RAG?
Enter fullscreen mode Exit fullscreen mode

Retrieved:

Machine Learning
Enter fullscreen mode Exit fullscreen mode

instead of:

Retrieval-Augmented Generation
Enter fullscreen mode Exit fullscreen mode

This happens because:

Vector similarity is approximate.

Solution?

Reranking


7. Reranking

Reranking improves retrieval quality.

Instead of trusting:

Top K vectors
Enter fullscreen mode Exit fullscreen mode

We re-score chunks.


Why Reranking Matters?

Without reranking:

Bad chunks may enter context.

Result:

❌ hallucination

❌ irrelevant answers

With reranking:

Only most relevant chunks are sent to LLM.


Import Reranker

from langchain_nvidia_ai_endpoints import (
    NVIDIARerank
)
Enter fullscreen mode Exit fullscreen mode

Initialize Reranker

reranker = (
    NVIDIARerank(
        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)
Enter fullscreen mode Exit fullscreen mode

Convert Milvus Results → Documents

Reranker expects:

LangChain Documents
Enter fullscreen mode Exit fullscreen mode

not strings.

from langchain_core.documents import (
    Document
)

retrieved_docs = [

    Document(
        page_content=
        r["entity"]
        ["text"]
    )

    for r in results[0]
]
Enter fullscreen mode Exit fullscreen mode

Run Reranking

reranked_docs = (
    reranker.compress_documents(

        documents=
        retrieved_docs,

        query=query
    )
)
Enter fullscreen mode Exit fullscreen mode

View Reranked Results

for doc in reranked_docs:

    print(
        doc.page_content
    )
Enter fullscreen mode Exit fullscreen mode

Now quality improves significantly.


8. Azure OpenAI Response Generation

Finally:

We generate answer.


Import Azure OpenAI

from langchain_openai import (
    AzureChatOpenAI
)
Enter fullscreen mode Exit fullscreen mode

Initialize LLM

llm = AzureChatOpenAI(

    azure_endpoint=
    os.getenv(
        "AZURE_OPENAI_ENDPOINT"
    ),

    api_key=
    os.getenv(
        "AZURE_OPENAI_KEY"
    ),

    deployment_name=
    "gpt-4o",

    temperature=0.2
)
Enter fullscreen mode Exit fullscreen mode

Why Low Temperature?

Lower:

temperature=0.2
Enter fullscreen mode Exit fullscreen mode

means:

More factual answers.

Good for:

RAG systems
Enter fullscreen mode Exit fullscreen mode

Build Context

context = "\n".join([

    doc.page_content

    for doc in reranked_docs
])
Enter fullscreen mode Exit fullscreen mode

Prompt Engineering

prompt = f"""

Answer ONLY
from context.

Context:

{context}

Question:

{query}

"""
Enter fullscreen mode Exit fullscreen mode

Strict prompt:

Prevents hallucination.


Generate Answer

response = llm.invoke(
    prompt
)

print(
    response.content
)
Enter fullscreen mode Exit fullscreen mode

9. Langfuse Observability

Production AI systems require monitoring.

Questions:

Did retrieval work?
Did hallucination happen?
Was response relevant?
Enter fullscreen mode Exit fullscreen mode

Langfuse solves this.


Install

pip install langfuse
Enter fullscreen mode Exit fullscreen mode

Import

from langfuse import (
    Langfuse
)
Enter fullscreen mode Exit fullscreen mode

Initialize Langfuse

langfuse = Langfuse(

    public_key=
    os.getenv(
        "LANGFUSE_PUBLIC_KEY"
    ),

    secret_key=
    os.getenv(
        "LANGFUSE_SECRET_KEY"
    ),

    host=
    os.getenv(
        "LANGFUSE_BASE_URL"
    )
)
Enter fullscreen mode Exit fullscreen mode

Log Retrieval

langfuse.create_event(

    name="retrieval",

    input={
        "query":
        query
    },

    output={
        "chunks":
        context
    }
)
Enter fullscreen mode Exit fullscreen mode

10. RAG Evaluation

We evaluate:

Retrieval Quality

Were chunks relevant?


Faithfulness

Was answer grounded?


Hallucination Score

Did model invent information?


Answer Relevance

Did answer actually solve query?


Example evaluation prompt:

evaluation_prompt = f"""

Evaluate:

Question:
{query}

Answer:
{response.content}

Context:
{context}

Score:
1. faithfulness
2. hallucination
3. relevance
"""
Enter fullscreen mode Exit fullscreen mode

Production RAG Pipeline

PDFs
 ↓
Loaders
 ↓
Chunking
 ↓
Embeddings
 ↓
Milvus
 ↓
Retrieval
 ↓
Reranking
 ↓
Prompt Building
 ↓
GPT-4o
 ↓
Answer
 ↓
Langfuse Monitoring
 ↓
Evaluation
Enter fullscreen mode Exit fullscreen mode

Common Challenges

Bad Retrieval

Fix:

✅ Better chunking

✅ Reranking

✅ Hybrid Search


Hallucination

Fix:

✅ Strict prompts

✅ Low temperature

✅ Better retrieval


Large PDFs

Fix:

✅ Chunking strategy

✅ Metadata filtering


Advanced RAG Techniques

Multi-Vector Retrieval

One chunk → multiple embeddings.

Better retrieval.


HyDE

Generate hypothetical answer first.

Then search.


RAPTOR

Hierarchical retrieval tree.

Better long document understanding.


Semantic Routing

Route query dynamically.


ColBERT

Token-level retrieval.

Highly accurate.


Final Thoughts

Basic RAG:

Retrieve → Generate
Enter fullscreen mode Exit fullscreen mode

Production RAG:

Retrieve
→ Rerank
→ Evaluate
→ Monitor
→ Improve
Enter fullscreen mode Exit fullscreen mode

That is how enterprise AI systems are built 🚀

Top comments (0)