This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.
RAG Chunking Strategies: Semantic Chunking, Overlapping, Recursive Splitting
Introduction
Document chunking is the foundation of any RAG system. How you split documents into chunks directly determines retrieval quality: chunks that are too small lose context, chunks that are too large dilute relevance, and naive splits break semantic units mid-thought. This article covers the major chunking strategies and when to use each.
Naive Fixed-Size Chunking
The simplest approach splits text every N characters or tokens:
def fixed_size_chunks(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
Fixed-size chunking is fast and predictable. However, it frequently splits in the middle of sentences, paragraphs, or code blocks, producing chunks that are semantically incomplete. Use it only for homogeneous text where content quality is not critical.
Recursive Character Text Splitter
LangChain's RecursiveCharacterTextSplitter tries to split on natural boundaries first, falling back to smaller separators:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ".", " ", ""],
keep_separator=True,
)
chunks = splitter.split_text(long_document)
The algorithm tries each separator in order. It first attempts to split on paragraph boundaries (\n\n). If a paragraph exceeds the chunk size, it splits on line breaks, then sentences, then spaces. This preserves as much natural structure as possible.
Semantic Chunking
Semantic chunking uses embedding similarity to detect natural boundaries:
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_chunk(text: str, threshold: float = 0.7) -> list[str]:
sentences = split_into_sentences(text)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Encode as we go
emb_current = model.encode(" ".join(current_chunk[-3:]))
emb_next = model.encode(sentences[i])
similarity = cosine_similarity(emb_current, emb_next)
if similarity < threshold or len(" ".join(current_chunk)) > 1000:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Semantic chunking produces chunks that are internally coherent: each chunk discusses a single topic. The threshold controls chunk granularity. Lower values create larger chunks with more context; higher values create smaller, tighter chunks.
Chunking by Document Structure
When documents have known structures (headings, sections), use the structure to define chunks:
import re
def structure_aware_chunk(markdown_text: str) -> list[dict]:
chunks = []
current_section = {"heading": "Introduction", "content": []}
for line in markdown_text.split("\n"):
heading_match = re.match(r"^(#{1,3})\s+(.+)$", line)
if heading_match:
if current_section["content"]:
chunks.append(current_section)
current_section = {
"heading": heading_match.group(2),
"level": len(heading_match.group(1)),
"content": [],
}
else:
current_section["content"].append(line)
if current_section["content"]:
chunks.append(current_section)
return chunks
Structure-aware chunking preserves document hierarchy. Each chunk retains a heading reference, enabling richer retrieval context and more accurate citation.
Sliding Window with Overlap
Overlap between adjacent chunks prevents information loss at boundaries:
def sliding_window_chunks(text: str, window: int = 512, stride: int = 384) -> list[str]:
chunks = []
for i in range(0, len(text) - window + 1, stride):
chunks.append(text[i:i + window])
return chunks
A 512-token window with 384-token stride means each adjacent pair overlaps by 128 tokens. This ensures that no query misses context that spans a chunk boundary. The trade-off is increased storage and more chunks to search.
Choosing the Right Strategy
| Strategy | Best For | Pros | Cons |
|----------|----------|------|------|
| Fixed-size | Simple docs, testing | Fast, predictable | Breaks sentences |
| Recursive
Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.
Found this useful? Check out more developer guides and tool comparisons on AI Study Room.
Top comments (0)