DEV Community

Cover image for Building an Advanced RAG System with Multiple Chunking Strategies — A Practical Guide
Isaac Natarajan
Isaac Natarajan

Posted on

Building an Advanced RAG System with Multiple Chunking Strategies — A Practical Guide

I built an Advanced RAG system that compares 4 chunking strategies (fixed-size, recursive, semantic, hierarchical) on Apple's 10-K filings using NVIDIA NIM models, Qdrant, and custom evaluation metrics. Semantic chunking won with an overall score of 0.86. Here's everything I learned.

Introduction
Retrieval-Augmented Generation (RAG) is one of the most practical applications of LLMs today. Instead of relying on a model's training data, RAG retrieves relevant information from your own documents and uses it to generate accurate, grounded answers.

But here's what most RAG tutorials skip: how you chunk your documents matters enormously. The same pipeline with different chunking strategies can produce wildly different results. I wanted to test this properly, so I built a system that runs 4 chunking strategies side by side on the same corpus and evaluates them with real metrics.

In this post I'll walk through everything — data ingestion, chunking, vector storage, retrieval, generation, evaluation, and a Streamlit chatbot UI.

Tech Stack

Component Tool / Technology
Embedding Model NVIDIA llama-nemotron-embed-1b-v2
LLM NVIDIA llama-3.3-nemotron-super-49b-v1.5
Vector Database Qdrant (local via Docker)
Pipeline LangChain
Tracing LangSmith
Evaluation Custom LLM-as-judge metrics
UI Streamlit

All NVIDIA models are accessed via https://integrate.api.nvidia.com/v1 which is OpenAI-compatible, making integration straightforward.

One important quirk with llama-3.3-nemotron-super-49b-v1.5 — it has a thinking mode that needs to be explicitly disabled, and you need high max_tokens (8192+) otherwise the model spends all its tokens on internal reasoning and returns None as content:

response = client.chat.completions.create(
    model=LLM_MODEL,
    messages=[...],
    max_tokens=8192,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)
Enter fullscreen mode Exit fullscreen mode

Similarly, the embedding model is asymmetric and requires an input_type parameter:

# For document chunks
client.embeddings.create(model=EMBED_MODEL, input=text, extra_body={"input_type": "passage"})

# For queries
client.embeddings.create(model=EMBED_MODEL, input=query, extra_body={"input_type": "query"})
Enter fullscreen mode Exit fullscreen mode

Data Ingestion
I used Apple's 10-K annual reports for 2022 and 2023, downloaded directly from Apple's investor relations page as PDFs. Financial documents are ideal for this kind of project because they have mixed content — dense paragraphs, tables, numbered sections, and boilerplate — which makes chunking strategy comparison genuinely meaningful.

Extraction with pdfplumber:

import pdfplumber

def load_pdfs():
    documents = []
    for filename in os.listdir("data/pdfs"):
        if filename.endswith(".pdf"):
            with pdfplumber.open(f"data/pdfs/{filename}") as pdf:
                full_text = ""
                for page in pdf.pages:
                    text = page.extract_text()
                    if text:
                        full_text += text + "\n"
            documents.append({"filename": filename, "text": clean_text(full_text)})
    return documents
Enter fullscreen mode Exit fullscreen mode

After cleaning, I ended up with ~221k characters from the 2022 report and ~207k from 2023.

The 4 Chunking Strategies
This is the heart of the project. Each strategy produces a different number and quality of chunks from the same documents.

1. Fixed-size Chunking
The simplest approach — split every N characters with some overlap regardless of content boundaries.

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=50, separator="\n")
chunks = splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Result: 951 chunks
Pros: Fast, simple, predictable
Cons: Cuts across sentences and paragraphs, losing context

2. Recursive Character Splitting
LangChain's default. Tries to split on paragraph breaks first, then sentences, then words — preserving semantic units as much as possible.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Result: 954 chunks
Pros: Smarter splits, respects natural language boundaries
Cons: Still fixed-size, just more intelligent about where to cut

3. Semantic Chunking
Instead of splitting by size, this approach embeds every sentence and splits where the semantic similarity between adjacent sentences drops below a threshold. Topics stay together, topic boundaries become chunk boundaries.

def semantic_chunking(documents, threshold=0.6, min_chunk_size=200):
    # Embed every sentence
    # Split where cosine similarity drops below threshold
    # Merge tiny chunks to ensure minimum size
Enter fullscreen mode Exit fullscreen mode

Key insight: The threshold matters enormously. At 0.8 I got 3281 tiny chunks that couldn't answer questions. Lowering to 0.6 produced 1123 meaningful chunks that performed much better.

Result: 1123 chunks (after tuning)
Pros: Topically coherent chunks, great for complex documents
Cons: Slow (embeds every sentence), sensitive to threshold choice

4. Hierarchical Chunking
Store small chunks for precise retrieval, but return their larger parent chunk to the LLM for rich context. Best of both worlds.

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=50)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)

for parent in parent_splitter.split_text(text):
    for child in child_splitter.split_text(parent):
        chunks.append({"text": child, "parent_text": parent, ...})
Enter fullscreen mode Exit fullscreen mode

During retrieval, the child chunk is used to find the right section, but the parent text is returned to the LLM:

if strategy == "hierarchical" and "parent_text" in result.payload:
    text = result.payload["parent_text"]  # Return richer context
else:
    text = result.payload["text"]
Enter fullscreen mode Exit fullscreen mode

Result: 1070 chunks
Pros: Precise retrieval + rich context, perfect faithfulness scores
Cons: More storage, context recall can suffer if parent chunks are too broad

Advanced RAG Techniques
Beyond chunking, I added three techniques to improve retrieval quality:

Query Rewriting
Before searching, the LLM generates 3 variations of the user's query to capture different aspects:

# Original: "What was Apple's total revenue in 2023?"
# Rewritten:
# 1. "What was Apple Inc.'s total revenue for fiscal year ending September 2023?"
# 2. "How much revenue did Apple generate during its 2023 fiscal period?"
# 3. "What is Apple's consolidated revenue for the twelve months ending 2023?"
Enter fullscreen mode Exit fullscreen mode

Each variation searches the vector store independently, results are deduplicated and ranked by score.

Hybrid Search (Dense + BM25)
Combines dense vector search (semantic meaning) with BM25 keyword search (exact term matching). Financial documents have specific numbers and terminology where exact matching helps.

# Dense search score * 0.7 + BM25 score * 0.3
combined_score = dense_score * 0.7 + bm25_score * 0.3
Enter fullscreen mode Exit fullscreen mode

Contextual Compression
Before passing chunks to the LLM, extract only the sentences relevant to the query. Reduces noise and token usage:

# From a 500-word chunk about Apple's products and revenue,
# extract only the 2 sentences about revenue figures
Enter fullscreen mode Exit fullscreen mode

Vector Storage with Qdrant
I chose Qdrant over ChromaDB for its better performance, built-in hybrid search support, and production-readiness. Running locally via Docker:

docker run -p 6333:6333 qdrant/qdrant
Enter fullscreen mode Exit fullscreen mode

Each chunking strategy gets its own collection (2048-dimensional vectors from the NVIDIA embedding model):

COLLECTION_NAMES = {
    "fixed_size": "fixed_size_collection",
    "recursive": "recursive_collection",
    "semantic": "semantic_collection",
    "hierarchical": "hierarchical_collection"
}
Enter fullscreen mode Exit fullscreen mode

One lesson learned: upsert in batches of 100, not all at once. Sending 1000+ points in a single request causes connection timeouts.

Evaluation Framework
I originally planned to use RAGAS but ran into dependency conflicts with the latest version. Instead of spending hours fighting package versions, I built custom LLM-as-judge metrics — which actually gives more control and transparency.

The 4 Metrics
Faithfulness — Does the answer stick to the retrieved context, or does the model hallucinate?

Answer Relevance — Does the response actually address the question asked?

Context Precision — Of what was retrieved, how much was actually relevant?

Context Recall — Does the context contain enough information to answer the question?

Each metric prompts the LLM to return a score between 0.0 and 1.0:

def faithfulness(answer, contexts):
    context = "\n\n".join([c[:300] for c in contexts])
    prompt = f"""Given this context: {context}
And this answer: {answer}
Is the answer fully supported by the context? Reply with just a number: 1.0 for yes, 0.5 for partially, 0.0 for no."""
    return llm_score(prompt)
Enter fullscreen mode Exit fullscreen mode

LangSmith Tracing
Every pipeline run — query, strategy, response, contexts, and all 4 metric scores — is logged to LangSmith automatically. This runs silently in the background and gives a full audit trail of every evaluation run.

Results
After evaluating all 4 strategies on 5 financial questions:

Strategy Faithfulness Ans. Relevance Ctx. Precision Ctx. Recall Overall
Fixed-size 0.70 1.00 0.62 0.60 0.73
Recursive 0.70 1.00 0.89 0.60 0.80
Semantic 0.90 1.00 0.76 0.80 0.86 🏆
Hierarchical 1.00 1.00 0.69 0.54 0.81

Key Findings

Semantic chunking wins overall (0.86) — After tuning the threshold from 0.8 to 0.6, semantic chunking produced the best faithfulness (0.90) and context recall (0.80). Topically coherent chunks mean the LLM gets focused, relevant context.

Hierarchical has perfect faithfulness (1.00) — Returning parent text to the LLM means it always has rich, complete context to work with. No hallucination.

Recursive has best context precision (0.89) — Smart splitting means retrieved chunks are highly relevant to the query.

Fixed-size is weakest but simplest — Works fine as a baseline but leaves performance on the table.

Streamlit Chatbot
To make the project interactive, I built a Streamlit UI that lets you switch between chunking strategies in real time and see retrieved contexts:

strategy = st.selectbox("Chunking Strategy", 
    ["fixed_size", "recursive", "semantic", "hierarchical"])

if prompt := st.chat_input("Ask about Apple's financials..."):
    result = rag_pipeline(prompt, strategy, use_rewriting=True, use_compression=True)
    st.markdown(result["response"])

    with st.expander("Retrieved Contexts"):
        for ctx in result["contexts"]:
            st.markdown(ctx)
Enter fullscreen mode Exit fullscreen mode

Run with streamlit run app.py. Try asking comparison questions like "How did iPhone revenue change between 2022 and 2023?" to see how different strategies handle multi-document retrieval.

Lessons Learned
1. The NVIDIA Nemotron model needs special handling
The model has a built-in thinking mode. Always set max_tokens=8192 and chat_template_kwargs: {"thinking": False} or you'll get None responses as the model exhausts its token budget on internal reasoning.

2. Semantic chunking threshold is critical
Threshold of 0.8 → 3281 tiny, useless chunks. Threshold of 0.6 → 1123 meaningful chunks. Always add a minimum chunk size as a guard.

3. Hierarchical chunking needs parent text for retrieval
If you retrieve child chunks but pass child text to the LLM, context recall suffers. Always return the parent text to the LLM while using the child for retrieval.

4. Batch your Qdrant upserts
Sending all vectors at once causes connection timeouts. Batch in groups of 100.

5. Build custom eval metrics when RAGAS doesn't cooperate
Dependency conflicts are real. Custom LLM-as-judge metrics are transparent, flexible, and work with any model.

6. Evaluation reveals what tuning hides
Without evaluation, I would never have caught that semantic was producing tiny useless chunks, or that hierarchical was ignoring parent text. Run eval early and often.

GitHub

Full source code available at: https://github.com/IsaacNatarajan/Advanced-RAG/

Built with NVIDIA NIM, Qdrant, LangChain, LangSmith, and Streamlit. If you found this useful, drop a ❤️ and feel free to ask questions in the comments.

Top comments (0)