Sridhar S

Posted on May 26

Master RAG Systems: Build an End-to-End LangChain Pipeline with Milvus, Reranking & Azure OpenAI 🚀

#python #machinelearning #ai #tutorial

Beyond Basic RAG: Learn LangChain + RAG End-to-End 🚀

Introduction

Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI.

Large Language Models (LLMs) like GPT-4, Claude, LLaMA, and Gemini are powerful. However, they suffer from one major issue:

Hallucination

Hallucination means:

The model confidently generates incorrect information.

Example:

Question:

Who is the CEO of my company?

Without access to your internal company data, an LLM may generate a completely wrong answer.

This is where RAG (Retrieval-Augmented Generation) becomes useful.

Instead of relying only on pretrained knowledge, RAG retrieves relevant information from external sources and provides context to the LLM before generating a response.

What is RAG?

RAG stands for:

Retrieval-Augmented Generation

Instead of:

Question → LLM → Answer

We do:

Question
   ↓
Retrieve Relevant Documents
   ↓
Provide Context to LLM
   ↓
Generate Grounded Response

This makes responses:

✅ More accurate

✅ Context-aware

✅ Less hallucinated

✅ Enterprise-ready

Complete RAG Architecture

Documents (PDFs, DOCX, TXT)
            ↓
      Document Loading
            ↓
         Chunking
            ↓
         Embeddings
            ↓
      Vector Database
            ↓
      Similarity Search
            ↓
         Reranking
            ↓
       Context Building
            ↓
            LLM
            ↓
         Final Answer
            ↓
     Monitoring & Evaluation

Required Installation

Before starting, install all dependencies.

pip install langchain
pip install langchain-community
pip install langchain-core
pip install langchain-openai
pip install langchain-text-splitters
pip install langchain-nvidia-ai-endpoints
pip install pymilvus
pip install pymupdf
pip install pypdf
pip install langfuse
pip install python-dotenv

Project Structure

project/
│
├── data/
│   ├── pdf/
│   └── text/
│
├── .env
├── rag_pipeline.py
└── requirements.txt

Environment Variables (.env)

Never hardcode API keys.

Create a .env file.

NVIDIA_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=your_endpoint
AZURE_OPENAI_KEY=your_key
AZURE_OPENAI_DEPLOYMENT=gpt-4o

LANGFUSE_PUBLIC_KEY=your_key
LANGFUSE_SECRET_KEY=your_key
LANGFUSE_BASE_URL=https://cloud.langfuse.com

1. Understanding LangChain Document Structure

LangChain stores documents in a standardized format.

A document contains:

page_content
metadata

page_content

This contains actual text.

Example:

page_content = "Generative AI is growing rapidly."

metadata

Metadata stores additional information.

Examples:

file name
author
created date
source
page number

Creating a LangChain Document

Import

from langchain_core.documents import Document

Code

from langchain_core.documents import Document

doc = Document(
    page_content="""
    Generative AI is a subset of Artificial Intelligence
    focused on creating content.
    """,
    metadata={
        "source": "genai.pdf",
        "author": "Sridhar",
        "pages": 10
    }
)

print(doc)

Output

Document(
    page_content='Generative AI...',
    metadata={
        'source': 'genai.pdf',
        'author': 'Sridhar',
        'pages': 10
    }
)

Why metadata matters?

In enterprise AI:

You often want:

“Show answer from document X page 5”

Metadata helps with traceability.

2. Loading Documents

Before processing documents, we must load them.

LangChain provides multiple loaders.

TextLoader

Used for:

.txt files
plain text files

Import

from langchain_community.document_loaders import TextLoader

Example

loader = TextLoader(
    "data/text/sample.txt",
    encoding="utf-8"
)

documents = loader.load()

print(documents)

DirectoryLoader

Loads multiple files from a folder.

Useful when:

You have:

100 PDFs
50 TXT files
many documents

Import

from langchain_community.document_loaders import DirectoryLoader

Example

loader = DirectoryLoader(
    "data/text",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={
        "encoding":"utf-8"
    }
)

documents = loader.load()

print(documents)

PDF Loader

Most enterprise RAG systems use PDFs.

LangChain supports:

PyPDFLoader

Simple and fast.

Import

from langchain_community.document_loaders import PyPDFLoader

Example

loader = PyPDFLoader(
    "data/pdf/rag_guide.pdf"
)

documents = loader.load()

print(documents[0])

Each page becomes:

Document(
    page_content="Page text",
    metadata={"page":1}
)

3. Chunking Documents

Chunking is one of the most important parts of RAG.

Why?

Because LLMs have token limits.

You cannot send:

500 page PDF

to GPT.

Instead:

We split documents into smaller chunks.

Why Chunking Matters?

Bad chunking causes:

❌ poor retrieval

❌ hallucination

❌ context loss

Good chunking improves:

✅ retrieval quality

✅ relevance

✅ accuracy

RecursiveCharacterTextSplitter

Most commonly used splitter.

Import

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter
)

Code

text_splitter = (
    RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
        length_function=len,
        separators=[
            "\n\n",
            "\n",
            " ",
            ""
        ]
    )
)

chunks = text_splitter.split_documents(
    documents
)

print(len(chunks))

Parameters Explained

chunk_size

How large each chunk should be.

Example:

chunk_size=500

means:

500 characters per chunk.

chunk_overlap

Prevents context loss.

Example:

Chunk 1:

Artificial Intelligence is...

Chunk 2 starts with:

Intelligence is...

This preserves continuity.

Best Practices

Recommended:

chunk_size = 300–800
chunk_overlap = 30–100

for most enterprise RAG systems.

4. Understanding Embeddings

Once chunking is completed, we need to convert text into a format machines can understand.

LLMs understand:

Numbers (Vectors)

Not raw text.

This is where Embeddings come in.

What are Embeddings?

Embeddings convert text into numerical vector representations.

Example:

Text:

"Artificial Intelligence"

becomes:

[0.24, -0.76, 0.88, ....]

These vectors help us find:

Semantic Meaning

Example:

What is AI?

and

Explain Artificial Intelligence

have similar meanings.

Embedding models place them close together in vector space.

Why Embeddings are Important in RAG?

Without embeddings:

Search becomes:

Keyword matching

Example:

Searching:

CEO

Only returns exact keyword matches.

With embeddings:

Search becomes:

Semantic Search

Meaning-based retrieval.

Even if wording differs.

NVIDIA Embeddings

We will use:

NVIDIA Llama Nemotron Embedding Model

Advantages:

✅ Fast

✅ High-quality embeddings

✅ Good semantic understanding

✅ Free developer tier

Import Required Libraries

import os

from dotenv import load_dotenv

from langchain_nvidia_ai_endpoints import (
    NVIDIAEmbeddings
)

Load Environment Variables

load_dotenv()

Initialize Embedding Model

embedding_model = (
    NVIDIAEmbeddings(
        model=
        "nvidia/llama-nemotron-embed-vl-1b-v2",

        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Convert Chunks into Embeddings

Before embedding:

We only need:

page_content

from chunks.

Extract Text

texts = [
    chunk.page_content
    for chunk in chunks
]

Generate Embeddings

embedded_vectors = (
    embedding_model.embed_documents(
        texts
    )
)

Check Embedding Dimension

print(
    len(
        embedded_vectors
    )
)

print(
    len(
        embedded_vectors[0]
    )
)

Output:

50
2048

Meaning:

50 chunks
2048 dimensional vector

Query Embedding

User questions also need embeddings.

Example:

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)

Now query and document vectors can be compared.

5. Vector Databases (Milvus)

Imagine storing:

Millions of embeddings

in SQL.

Very slow.

Traditional databases are not optimized for:

Similarity Search

We need:

Vector Database

Examples:

Pinecone
FAISS
Chroma
Milvus
Weaviate

We will use:

Milvus

Why?

✅ Fast retrieval

✅ Open-source

✅ Enterprise-ready

✅ Optimized for vectors

Install Milvus

pip install pymilvus

Import Milvus

from pymilvus import (
    MilvusClient
)

Create Milvus Connection

client = MilvusClient(
    uri="milvus_demo.db"
)

print(
    "Connected Successfully"
)

Create Collection

A collection is like:

SQL Table

for vector data.

Create Collection

try:

    client.create_collection(
        collection_name=
        "rag_collection",

        dimension=2048
    )

    print(
        "Collection Created"
    )

except Exception as e:

    print(e)

Why Dimension Matters?

Embedding vector size:

Collection dimension must match embedding dimension.

Otherwise:

Insertion will fail

Insert Data into Milvus

We store:

ID
Embedding vector
Chunk text

Prepare Data

data = []

for i, (
    chunk,
    embedding
) in enumerate(
    zip(
        chunks,
        embedded_vectors
    )
):

    data.append({

        "id": i,

        "vector":
        embedding,

        "text":
        chunk.page_content
    })

Insert into Collection

client.insert(
    collection_name=
    "rag_collection",

    data=data
)

print(
    "Inserted Successfully"
)

6. Similarity Retrieval

Now comes the real magic.

When user asks:

"What is RAG?"

We do:

Convert query → embedding
Search similar vectors
Return relevant chunks

Generate Query Embedding

query = (
    "What is RAG?"
)

query_embedding = (
    embedding_model.embed_query(
        query
    )
)

Search in Milvus

results = client.search(

    collection_name=
    "rag_collection",

    data=[
        query_embedding
    ],

    limit=5,

    output_fields=[
        "text"
    ]
)

Understanding Parameters

limit

How many chunks to retrieve.

Example:

limit=5

returns:

Top 5 relevant chunks

output_fields

Fields to return.

Example:

"text"

returns chunk text.

View Retrieved Chunks

for result in results[0]:

    print(
        result["entity"]
        ["text"]
    )

    print(
        "----------------"
    )

Problem with Similarity Search

Sometimes:

Reranking

7. Reranking

Reranking improves retrieval quality.

Instead of trusting:

Top K vectors

We re-score chunks.

Why Reranking Matters?

Without reranking:

Bad chunks may enter context.

Result:

❌ hallucination

❌ irrelevant answers

With reranking:

Only most relevant chunks are sent to LLM.

Import Reranker

from langchain_nvidia_ai_endpoints import (
    NVIDIARerank
)

Initialize Reranker

reranker = (
    NVIDIARerank(
        nvidia_api_key=
        os.getenv(
            "NVIDIA_API_KEY"
        )
    )
)

Convert Milvus Results → Documents

Reranker expects:

LangChain Documents

not strings.

from langchain_core.documents import (
    Document
)

retrieved_docs = [

    Document(
        page_content=
        r["entity"]
        ["text"]
    )

    for r in results[0]
]

Run Reranking

reranked_docs = (
    reranker.compress_documents(

        documents=
        retrieved_docs,

        query=query
    )
)

View Reranked Results

for doc in reranked_docs:

    print(
        doc.page_content
    )

Now quality improves significantly.

8. Azure OpenAI Response Generation

Finally:

We generate answer.

Import Azure OpenAI

from langchain_openai import (
    AzureChatOpenAI
)

Initialize LLM

llm = AzureChatOpenAI(

    azure_endpoint=
    os.getenv(
        "AZURE_OPENAI_ENDPOINT"
    ),

    api_key=
    os.getenv(
        "AZURE_OPENAI_KEY"
    ),

    deployment_name=
    "gpt-4o",

    temperature=0.2
)

Why Low Temperature?

Lower:

temperature=0.2

means:

Build Context

context = "\n".join([

    doc.page_content

    for doc in reranked_docs
])

Prompt Engineering

prompt = f"""

Answer ONLY
from context.

Context:

{context}

Question:

{query}

"""

Strict prompt:

Prevents hallucination.

Generate Answer

response = llm.invoke(
    prompt
)

print(
    response.content
)

9. Langfuse Observability

Production AI systems require monitoring.

Questions:

Did retrieval work?
Did hallucination happen?
Was response relevant?

Langfuse solves this.

Install

pip install langfuse

Import

from langfuse import (
    Langfuse
)

Initialize Langfuse

langfuse = Langfuse(

    public_key=
    os.getenv(
        "LANGFUSE_PUBLIC_KEY"
    ),

    secret_key=
    os.getenv(
        "LANGFUSE_SECRET_KEY"
    ),

    host=
    os.getenv(
        "LANGFUSE_BASE_URL"
    )
)

Log Retrieval

langfuse.create_event(

    name="retrieval",

    input={
        "query":
        query
    },

    output={
        "chunks":
        context
    }
)

10. RAG Evaluation

We evaluate:

Retrieval Quality

Were chunks relevant?

Faithfulness

Was answer grounded?

Hallucination Score

Did model invent information?

Answer Relevance

Did answer actually solve query?

Example evaluation prompt:

evaluation_prompt = f"""

Evaluate:

Question:
{query}

Answer:
{response.content}

Context:
{context}

Score:
1. faithfulness
2. hallucination
3. relevance
"""

Production RAG Pipeline

PDFs
 ↓
Loaders
 ↓
Chunking
 ↓
Embeddings
 ↓
Milvus
 ↓
Retrieval
 ↓
Reranking
 ↓
Prompt Building
 ↓
GPT-4o
 ↓
Answer
 ↓
Langfuse Monitoring
 ↓
Evaluation

Common Challenges

Bad Retrieval

Fix:

✅ Better chunking

✅ Reranking

✅ Hybrid Search

Hallucination

Fix:

✅ Strict prompts

✅ Low temperature

✅ Better retrieval

Large PDFs

Fix:

✅ Chunking strategy

✅ Metadata filtering

Advanced RAG Techniques

Multi-Vector Retrieval

One chunk → multiple embeddings.

Better retrieval.

HyDE

Generate hypothetical answer first.

Then search.

RAPTOR

Hierarchical retrieval tree.

Better long document understanding.

Semantic Routing

Route query dynamically.

ColBERT

Token-level retrieval.

Highly accurate.

Final Thoughts

Basic RAG:

Retrieve → Generate

Production RAG:

Retrieve
→ Rerank
→ Evaluate
→ Monitor
→ Improve

That is how enterprise AI systems are built 🚀

Top comments (1)

Harjot Singh • May 31

The reranking step is the part of this pipeline I'd point readers at, because it's the highest-leverage and most-skipped piece. Plain vector retrieval optimizes for embedding similarity, which is a decent first-pass filter but routinely ranks a surface-similar-but-irrelevant chunk above the actually-correct one. A reranker (cross-encoder scoring query+chunk together) fixes exactly that, and on most RAG systems adding rerank moves answer quality more than swapping the LLM does. Including it in an end-to-end pipeline instead of stopping at "embed and top-k" is what makes this production-shaped rather than a demo.

The thing I'd build on top: the abstain path. Even with Milvus + rerank, sometimes the right answer just isn't in the corpus, and the trustworthy system says "not supported" instead of letting Azure OpenAI generate a confident fill. That retrieve-rerank-then-verify discipline is core to how I build Moonshift, the thing I work on - a multi-agent pipeline that takes a prompt to a deployed SaaS, where a verify layer gates output against the retrieved evidence rather than trusting the model. Multi-model routing keeps a build ~$3 flat, first run free no card. Solid end-to-end writeup. What reranker did you use, and how much did it move retrieval quality vs vector-only? In my experience rerank is the single biggest quality lever in the whole stack.