Vivek

Posted on May 24

Chunking Strategies for LLM Applications: A Practical Guide to Better RAG Systems

#ai #rag #learning #programming

Learn how chunking impacts retrieval quality, embedding performance, and the overall effectiveness of Retrieval-Augmented Generation (RAG) systems.

Introduction

When building AI applications using Retrieval-Augmented Generation (RAG), developers often focus on selecting the best LLM or embedding model. But one foundational step is frequently underestimated chunking

Chunking

Chunking is the process of breaking large documents into smaller, manageable pieces before generating embeddings and storing them in a vector database.

Poor chunking can lead to:

Irrelevant retrieval results
Hallucinated answers
Missing context
Higher inference costs

Good chunking, on the other hand, dramatically improves retrieval precision and response quality.

In this article, we'll explore the most common chunking strategies, their trade-offs, and when to use each.

Why Chunking Matters

LLMs and embedding models cannot process infinitely large documents efficiently.

Consider a 200-page PDF.

Instead of embedding the entire file as one vector, we split it into smaller chunks:

Large Document
      ↓
 Chunking
      ↓
Embeddings
      ↓
Vector Database
      ↓
Semantic Retrieval
      ↓
LLM Response

Without Chunking

A single massive embedding:

loses semantic granularity
retrieves irrelevant sections
increases token cost

With Chunking

Relevant document sections become searchable and retrievable.

Understanding the Chunking Trade-Off

Chunk size affects retrieval quality.

Too small:

Missing context

Too large:

Noise + irrelevant information

The ideal chunk balances:

semantic meaning
retrieval precision
token efficiency

1. Fixed-Size Chunking

The simplest and most widely used approach.

Documents are split based on a fixed character or token limit.

Example:

500 tokens
1000 characters

How It Works

Document
──────────────────────────
Chunk 1 (500 tokens)
Chunk 2 (500 tokens)
Chunk 3 (500 tokens)

Python Example

Using LangChain:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_text(document)

Pros

Easy to implement
Fast processing
Predictable chunk sizes

Cons

Ignores document structure
May cut sentences mid-way
Can reduce semantic meaning

Best For

quick prototypes
small datasets
simple RAG systems

2. Recursive Chunking

A smarter version of fixed-size chunking.

Instead of splitting blindly, it attempts to preserve structure.

Typical hierarchy:

Paragraph
Sentence
Word

Only if a larger section exceeds size limits does it split further.

Workflow

Paragraph too large?
        ↓
Split into sentences
        ↓
Sentence too large?
        ↓
Split into words

Example

LangChain Recursive Splitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_text(document)

Pros

Preserves meaning
Better retrieval quality
Handles mixed documents

Cons

Slightly slower
May still ignore domain-specific structure

Best For

Most RAG systems.

This is often the default recommendation.

3. Sentence-Based Chunking

This strategy keeps chunks aligned with sentence boundaries.

Instead of arbitrary token counts:

Chunk = Complete Sentences

Example

Document:

AI systems rely on retrieval.
Chunking improves retrieval quality.
Poor chunking hurts accuracy.

Possible chunks:

Chunk 1:
AI systems rely on retrieval.

Chunk 2:
Chunking improves retrieval quality.

Chunk 3:
Poor chunking hurts accuracy.

Python Example

Using NLTK:

import nltk
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(document)

Pros

Natural language boundaries
Cleaner embeddings
Improved semantic integrity

Cons

Uneven chunk sizes
Large sentences may exceed limits

Best For

conversational data
articles
QA systems

4. Paragraph-Based Chunking

Paragraphs usually contain a coherent idea.

This makes them useful chunk boundaries.

Example

Paragraph 1 → Chunk 1
Paragraph 2 → Chunk 2
Paragraph 3 → Chunk 3

Pros

High semantic coherence
Human-readable chunks
Works well for blogs and docs

Cons

Paragraph length varies
Large paragraphs can overflow

Best For

blogs
documentation
research papers

5. Overlapping Chunking

One major issue with chunking:

context loss at boundaries.

Example:

Chunk 1:

The API authentication uses JWT...

Chunk 2:

...tokens for secure communication.

Important meaning spans both chunks.

Overlap solves this.

How Overlap Works

Chunk 1
──────────────
AAAA BBBB CCCC

Chunk 2
          CCCC DDDD EEEE

Notice:

CCCC

appears in both chunks.

Code Example

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)

Pros

Better retrieval continuity
Reduces boundary problems
Higher answer accuracy

Cons

More embeddings
Larger vector storage
Increased retrieval cost

Best For

Nearly all production RAG systems.

Typical overlap:

10–20%

6. Semantic Chunking

Semantic chunking uses meaning instead of size.

The document is split where topic changes occur.

This is significantly more intelligent.

Concept

Instead of:

Every 500 tokens

we split by:

Meaning shift

Example

Document:

Section A → Databases
Section B → Kubernetes
Section C → Security

Semantic chunking creates:

Chunk 1 → Database topic
Chunk 2 → Kubernetes topic
Chunk 3 → Security topic

High-Level Pipeline

Text
 ↓
Sentence embeddings
 ↓
Similarity comparison
 ↓
Topic boundary detection
 ↓
Chunks

Python Example (Conceptual)

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

Sentence similarity determines where to split.

Pros

Excellent retrieval quality
Topic-aware
Strong contextual relevance

Cons

Computationally expensive
More implementation effort

Best For

enterprise search
legal documents
knowledge bases

7. Structure-Aware Chunking

Some documents already contain structure.

Examples:

HTML headings
Markdown sections
PDFs with titles
Code files

Instead of ignoring this, we use it.

Example

Markdown:

# Authentication
JWT details...

# Rate Limiting
API throttling...

Chunks:

Authentication section
Rate Limiting section

Code Example

Markdown Header Splitter:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [
    ("#", "Header1"),
    ("##", "Header2")
]

Pros

High semantic consistency
Uses author intent
Excellent for documentation

Cons

Depends on clean formatting
Less effective on raw text

Best For

developer docs
wikis
technical manuals

8. Code Chunking

Source code needs special handling.

Splitting every 500 characters can break logic.

Instead:

Split by:

function
class
module
AST nodes

Bad Chunk

def login():
    ...

cut halfway.

Better Chunk

Entire login() function

Example Using Tree-sitter

import tree_sitter

AST-based parsing preserves syntax.

Pros

Maintains logical structure
Better code retrieval
Strong for AI coding assistants

Cons

Language-specific tooling

Best For

code copilots
repository search
software documentation

Comparing Chunking Strategies

Strategy	Quality	Complexity	Best Use
Fixed Size	Low	Low	Prototypes
Recursive	High	Low	General RAG
Sentence	Medium	Low	QA
Paragraph	Medium	Low	Articles
Overlap	High	Low	Production RAG
Semantic	Very High	High	Enterprise
Structure-Aware	High	Medium	Docs
Code Chunking	Very High	High	Code AI

A Practical Chunking Strategy

Many successful RAG systems use a hybrid approach.

Example:

Structure-aware
        +
Recursive splitting
        +
10–20% overlap

Pipeline:

Document
   ↓
Heading Split
   ↓
Recursive Chunking
   ↓
Overlap
   ↓
Embeddings
   ↓
Vector DB

This usually offers the best balance between:

relevance
cost
simplicity

Final Thoughts

Chunking is not just preprocessing.

It directly influences:

retrieval precision
embedding quality
hallucination rate
user experience

There is no universal best strategy.

A good rule:

Start with recursive + overlap
Move to semantic or structure-aware chunking as complexity grows
Use code-aware chunking for engineering systems

In many cases, improving chunking yields larger gains than switching to a bigger LLM.