Technical Acronyms:
- RAG: Retrieval-Augmented Generation—enhancing LLM responses with retrieved context
- NLP: Natural Language Processing—computational text analysis
- NLTK: Natural Language Toolkit—Python library for NLP tasks
- AST: Abstract Syntax Tree—hierarchical code representation
Statistical & Mathematical Terms:
- Cosine Similarity: Measures angle between vectors (0-1 scale for normalized vectors)
- Percentile: Value below which a percentage of observations fall
- Token: Sub-word unit (~0.75 words in English on average)
Introduction: Why Chunking Is Your RAG System's Foundation
Imagine you're a librarian organizing a massive archive. Someone asks about "renewable energy policies in Europe." You have three choices:
- Hand them entire books: They'll find the answer eventually, but waste hours sifting through irrelevant chapters
- Give them individual sentences: They'll get fragments without context—"The policy was enacted" means nothing alone
- Provide well-selected paragraphs: Enough context to understand, focused enough to be useful
This is the chunking problem in RAG systems. Your chunking strategy determines whether retrieval returns gold or garbage.
Here's another way to think about it: Chunking is like cutting pizza. Cut slices too big, and people can't eat them properly. Cut too small, and you have crumbs nobody can pick up. The perfect slice fits in hand and delivers a complete bite.
A third analogy: Chunking is database indexing for unstructured data. Just as poor index design cripples query performance, poor chunking cripples retrieval quality. You wouldn't index a customer table on a randomly-generated column—so why chunk documents without considering their semantic structure?
The Chunk Data Structure
Every chunking strategy produces the same output format:
from typing import List, Dict, Any
from dataclasses import dataclass
import hashlib
@dataclass
class Chunk:
"""Represents a document chunk with metadata."""
content: str
chunk_id: str
source_doc: str
start_char: int
end_char: int
chunk_index: int
metadata: Dict[str, Any]
def __post_init__(self):
if not self.chunk_id:
content_hash = hashlib.md5(self.content.encode()).hexdigest()[:12]
self.chunk_id = f"{self.source_doc}_{self.chunk_index}_{content_hash}"
Fixed-Size Chunking: The Baseline Approach
Fixed-size chunking is the "Hello World" of document splitting—simple, predictable, and often good enough for homogeneous text.
import tiktoken
from typing import List
class FixedSizeChunker:
"""
Fixed-size chunking with configurable overlap.
Supports both character and token-based sizing.
"""
def __init__(
self,
chunk_size: int = 1000,
overlap: int = 200,
min_chunk_size: int = 100,
unit: str = "chars" # "chars" or "tokens"
):
if overlap >= chunk_size:
raise ValueError(f"Overlap ({overlap}) must be less than chunk_size ({chunk_size})")
self.chunk_size = chunk_size
self.overlap = overlap
self.min_chunk_size = min_chunk_size
self.unit = unit
self.step_size = chunk_size - overlap
if unit == "tokens":
self.encoding = tiktoken.get_encoding("cl100k_base")
def _get_units(self, text: str) -> List:
"""Convert text to units (chars or tokens)."""
if self.unit == "tokens":
return self.encoding.encode(text)
return list(text)
def _units_to_text(self, units: List) -> str:
"""Convert units back to text."""
if self.unit == "tokens":
return self.encoding.decode(units)
return "".join(units)
def chunk(self, text: str, source_doc: str = "unknown") -> List[Chunk]:
"""Split text into fixed-size chunks with overlap."""
if not text or not text.strip():
return []
units = self._get_units(text)
chunks = []
start = 0
chunk_index = 0
while start < len(units):
end = min(start + self.chunk_size, len(units))
chunk_units = units[start:end]
chunk_content = self._units_to_text(chunk_units)
if len(chunk_units) < self.min_chunk_size and chunks:
# Merge small trailing chunk with previous
prev = chunks[-1]
chunks[-1] = Chunk(
content=prev.content + chunk_content,
chunk_id="",
source_doc=source_doc,
start_char=prev.start_char,
end_char=-1,
chunk_index=prev.chunk_index,
metadata={"strategy": "fixed_size", "merged_trailing": True}
)
break
chunks.append(Chunk(
content=chunk_content,
chunk_id="",
source_doc=source_doc,
start_char=start if self.unit == "chars" else -1,
end_char=end if self.unit == "chars" else -1,
chunk_index=chunk_index,
metadata={"strategy": "fixed_size", "unit": self.unit}
))
start += self.step_size
chunk_index += 1
return chunks
def estimate_cost(self, text: str, price_per_million: float = 0.02) -> Dict:
"""Estimate embedding costs including overlap overhead."""
chunks = self.chunk(text)
total_chars = sum(len(c.content) for c in chunks)
estimated_tokens = total_chars / 4 # Approximate
return {
"chunk_count": len(chunks),
"total_chars": total_chars,
"estimated_tokens": int(estimated_tokens),
"estimated_cost": (estimated_tokens / 1_000_000) * price_per_million,
"overhead_ratio": total_chars / len(text) if text else 0
}
Sentence-Based Chunking: Respecting Natural Boundaries
Fixed-size chunking's flaw: it cuts mid-sentence. "The quarterly revenue was" in one chunk and "$4.2 billion" in another means neither captures the complete thought.
import nltk
import re
from typing import List
class SentenceChunker:
"""
Sentence-aware chunking that groups sentences up to size limits.
Never breaks mid-sentence, preserving semantic units.
"""
def __init__(
self,
max_chunk_size: int = 1000,
min_chunk_size: int = 200,
overlap_sentences: int = 1
):
self.max_chunk_size = max_chunk_size
self.min_chunk_size = min_chunk_size
self.overlap_sentences = overlap_sentences
try:
self.sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
except LookupError:
nltk.download('punkt_tab')
self.sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
def _split_sentences(self, text: str) -> List[str]:
"""Split text into sentences, handling abbreviations."""
text = re.sub(r'\b(Mr|Mrs|Dr|Prof|Inc|Ltd|Jr|Sr)\.\s', r'\1 ', text)
sentences = self.sent_tokenizer.tokenize(text)
sentences = [s.replace('', '.') for s in sentences]
return [s.strip() for s in sentences if s.strip()]
def chunk(self, text: str, source_doc: str = "unknown") -> List[Chunk]:
"""Group sentences into chunks respecting size limits."""
sentences = self._split_sentences(text)
if not sentences:
return []
chunks = []
current_sentences = []
current_size = 0
chunk_index = 0
for sentence in sentences:
sentence_size = len(sentence)
# Handle oversized sentences
if sentence_size > self.max_chunk_size:
if current_sentences:
chunks.append(self._create_chunk(current_sentences, chunk_index, source_doc))
chunk_index += 1
current_sentences = []
current_size = 0
chunks.append(Chunk(
content=sentence,
chunk_id="",
source_doc=source_doc,
start_char=-1, end_char=-1,
chunk_index=chunk_index,
metadata={"strategy": "sentence", "oversized": True}
))
chunk_index += 1
continue
if current_size + sentence_size > self.max_chunk_size and current_sentences:
chunks.append(self._create_chunk(current_sentences, chunk_index, source_doc))
chunk_index += 1
# Keep overlap sentences
overlap_start = max(0, len(current_sentences) - self.overlap_sentences)
current_sentences = current_sentences[overlap_start:]
current_size = sum(len(s) for s in current_sentences)
current_sentences.append(sentence)
current_size += sentence_size
if current_sentences:
chunks.append(self._create_chunk(current_sentences, chunk_index, source_doc))
return chunks
def _create_chunk(self, sentences: List[str], index: int, source_doc: str) -> Chunk:
content = " ".join(sentences)
return Chunk(
content=content,
chunk_id="",
source_doc=source_doc,
start_char=-1, end_char=-1,
chunk_index=index,
metadata={"strategy": "sentence", "sentence_count": len(sentences)}
)
Semantic Chunking: Splitting by Meaning
Semantic chunking identifies where topics shift and splits there—like a skilled editor who knows exactly where one idea ends and another begins.
import numpy as np
from typing import List, Tuple, Callable
from dataclasses import dataclass
@dataclass
class SemanticBoundary:
position: int
similarity_drop: float
class SemanticChunker:
"""
Semantic chunking using embedding similarity to detect topic boundaries.
Process:
1. Split into sentences
2. Embed sentence windows
3. Find similarity drops between adjacent windows
4. Split at significant drops
"""
def __init__(
self,
embedding_fn: Callable[[List[str]], np.ndarray],
min_chunk_sentences: int = 3,
max_chunk_sentences: int = 20,
window_size: int = 2,
breakpoint_percentile: float = 25
):
self.embedding_fn = embedding_fn
self.min_chunk_sentences = min_chunk_sentences
self.max_chunk_sentences = max_chunk_sentences
self.window_size = window_size
self.breakpoint_percentile = breakpoint_percentile
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
norm_a, norm_b = np.linalg.norm(a), np.linalg.norm(b)
if norm_a == 0 or norm_b == 0:
return 0.0
return np.dot(a, b) / (norm_a * norm_b)
def _split_sentences(self, text: str) -> List[str]:
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
return [s.strip() for s in sentences if s.strip()]
def chunk(self, text: str, source_doc: str = "unknown") -> Tuple[List[Chunk], List[SemanticBoundary]]:
sentences = self._split_sentences(text)
if len(sentences) <= self.min_chunk_sentences:
return [Chunk(
content=text, chunk_id="", source_doc=source_doc,
start_char=0, end_char=len(text), chunk_index=0,
metadata={"strategy": "semantic"}
)], []
# Create windows and embed
windows = [" ".join(sentences[i:i+self.window_size])
for i in range(0, len(sentences), self.window_size)]
if len(windows) < 2:
return [Chunk(
content=text, chunk_id="", source_doc=source_doc,
start_char=0, end_char=len(text), chunk_index=0,
metadata={"strategy": "semantic"}
)], []
embeddings = self.embedding_fn(windows)
# Calculate similarities
similarities = [self._cosine_similarity(embeddings[i], embeddings[i+1])
for i in range(len(embeddings)-1)]
# Find boundaries at low similarity points
threshold = np.percentile(similarities, self.breakpoint_percentile)
boundaries = [SemanticBoundary((i+1) * self.window_size, 1-sim)
for i, sim in enumerate(similarities) if sim < threshold]
# Create chunks
chunks = []
positions = [0] + [b.position for b in boundaries] + [len(sentences)]
for i in range(len(positions) - 1):
start, end = positions[i], min(positions[i+1], len(sentences))
chunk_sentences = sentences[start:end]
# Enforce max size
while len(chunk_sentences) > self.max_chunk_sentences:
chunks.append(Chunk(
content=" ".join(chunk_sentences[:self.max_chunk_sentences]),
chunk_id="", source_doc=source_doc,
start_char=-1, end_char=-1, chunk_index=len(chunks),
metadata={"strategy": "semantic", "forced_split": True}
))
chunk_sentences = chunk_sentences[self.max_chunk_sentences:]
if chunk_sentences:
chunks.append(Chunk(
content=" ".join(chunk_sentences),
chunk_id="", source_doc=source_doc,
start_char=-1, end_char=-1, chunk_index=len(chunks),
metadata={"strategy": "semantic"}
))
return chunks, boundaries
Document-Specific Chunking: One Size Does Not Fit All
Different document types require specialized strategies. A Python file shouldn't be chunked like a novel.
Markdown Chunker
import re
from typing import List
class MarkdownChunker:
"""Markdown-aware chunking that respects document structure."""
HEADER_PATTERN = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
CODE_BLOCK_PATTERN = re.compile(r'```
[\s\S]*?
```', re.MULTILINE)
def __init__(self, max_chunk_size: int = 1500, split_on_headers: List[int] = [1, 2]):
self.max_chunk_size = max_chunk_size
self.split_on_headers = split_on_headers
def chunk(self, content: str, source_doc: str = "unknown") -> List[Chunk]:
# Protect code blocks from splitting
code_blocks = {}
def protect(match):
key = f"__CODE_{len(code_blocks)}__"
code_blocks[key] = match.group(0)
return key
protected = self.CODE_BLOCK_PATTERN.sub(protect, content)
# Split on headers
sections = []
current_pos = 0
for match in self.HEADER_PATTERN.finditer(protected):
if len(match.group(1)) in self.split_on_headers:
if current_pos < match.start():
sections.append(protected[current_pos:match.start()].strip())
current_pos = match.start()
if current_pos < len(protected):
sections.append(protected[current_pos:].strip())
# Restore code blocks and create chunks
chunks = []
for section in sections:
for key, code in code_blocks.items():
section = section.replace(key, code)
if len(section) <= self.max_chunk_size:
chunks.append(Chunk(
content=section, chunk_id="", source_doc=source_doc,
start_char=-1, end_char=-1, chunk_index=len(chunks),
metadata={"strategy": "markdown"}
))
else:
# Split large sections by paragraphs
for para in section.split('\n\n'):
if para.strip():
chunks.append(Chunk(
content=para.strip(), chunk_id="", source_doc=source_doc,
start_char=-1, end_char=-1, chunk_index=len(chunks),
metadata={"strategy": "markdown", "split": True}
))
return chunks
Code Chunker
import re
from pathlib import Path
from typing import List, Dict
class CodeChunker:
"""Code-aware chunking that preserves functions and classes."""
PATTERNS = {
"python": {
"function": re.compile(r'^(async\s+)?def\s+\w+\s*\([^)]*\):', re.MULTILINE),
"class": re.compile(r'^class\s+\w+\s*(?:\([^)]*\))?:', re.MULTILINE),
},
"javascript": {
"function": re.compile(r'^(?:async\s+)?function\s+\w+\s*\([^)]*\)', re.MULTILINE),
"class": re.compile(r'^class\s+\w+', re.MULTILINE),
}
}
def __init__(self, max_chunk_size: int = 2000, language: str = "python"):
self.max_chunk_size = max_chunk_size
self.language = language
self.patterns = self.PATTERNS.get(language, self.PATTERNS["python"])
def chunk(self, content: str, source_doc: str = "unknown") -> List[Chunk]:
# Find all function/class boundaries
blocks = []
for block_type, pattern in self.patterns.items():
for match in pattern.finditer(content):
blocks.append({"type": block_type, "start": match.start()})
if not blocks:
return FixedSizeChunker(self.max_chunk_size, 200).chunk(content, source_doc)
blocks.sort(key=lambda x: x["start"])
chunks = []
for i, block in enumerate(blocks):
start = block["start"]
end = blocks[i+1]["start"] if i+1 < len(blocks) else len(content)
block_content = content[start:end].strip()
if len(block_content) <= self.max_chunk_size:
chunks.append(Chunk(
content=block_content, chunk_id="", source_doc=source_doc,
start_char=start, end_char=end, chunk_index=len(chunks),
metadata={"strategy": "code", "block_type": block["type"]}
))
else:
# Split large blocks
lines = block_content.split('\n')
current = []
for line in lines:
if sum(len(l) for l in current) + len(line) > self.max_chunk_size and current:
chunks.append(Chunk(
content='\n'.join(current), chunk_id="", source_doc=source_doc,
start_char=-1, end_char=-1, chunk_index=len(chunks),
metadata={"strategy": "code", "block_type": f"{block['type']}_part"}
))
current = current[-3:] # Keep context
current.append(line)
if current:
chunks.append(Chunk(
content='\n'.join(current), chunk_id="", source_doc=source_doc,
start_char=-1, end_char=-1, chunk_index=len(chunks),
metadata={"strategy": "code"}
))
return chunks
Universal Chunker
from pathlib import Path
from typing import List
class UniversalChunker:
"""Auto-detect document type and apply appropriate strategy."""
def __init__(self, default_chunk_size: int = 1000):
self.chunk_size = default_chunk_size
self.markdown = MarkdownChunker(max_chunk_size=default_chunk_size)
self.python = CodeChunker(max_chunk_size=default_chunk_size, language="python")
self.javascript = CodeChunker(max_chunk_size=default_chunk_size, language="javascript")
self.fallback = SentenceChunker(max_chunk_size=default_chunk_size)
def chunk(self, content: str, source: str) -> List[Chunk]:
ext = Path(source).suffix.lower()
if ext in ['.md', '.markdown']:
return self.markdown.chunk(content, source)
elif ext == '.py':
return self.python.chunk(content, source)
elif ext in ['.js', '.jsx', '.ts', '.tsx']:
return self.javascript.chunk(content, source)
else:
return self.fallback.chunk(content, source)
Strategy Selection Guide
def recommend_strategy(doc_type: str, query_complexity: str, budget_priority: bool = False) -> dict:
"""
Get chunking recommendation based on requirements.
Returns: {"strategy": str, "chunk_size": int, "overlap": int, "reasoning": str}
"""
if budget_priority:
return {
"strategy": "fixed_size",
"chunk_size": 1000,
"overlap": 50,
"reasoning": "Budget priority: minimal overhead"
}
if doc_type == "code":
return {
"strategy": "code_aware",
"chunk_size": 1500,
"overlap": 200,
"reasoning": "Code requires structure-aware chunking"
}
if doc_type == "markdown":
return {
"strategy": "markdown_aware",
"chunk_size": 1000,
"overlap": 100,
"reasoning": "Markdown structure should guide splits"
}
if query_complexity == "complex":
return {
"strategy": "semantic",
"chunk_size": 700,
"overlap": 100,
"reasoning": "Complex queries need semantic coherence"
}
return {
"strategy": "sentence",
"chunk_size": 800,
"overlap": 100,
"reasoning": "Best quality/cost balance for general text"
}
Data Engineer's ROI Lens: The Business Impact of Chunking
def analyze_chunking_roi(
monthly_documents: int,
avg_tokens_per_doc: int,
chunk_size: int = 512,
overlap_pct: float = 0.15,
embedding_price: float = 0.02, # per 1M tokens
support_ticket_cost: float = 25.0
) -> dict:
"""Calculate ROI impact of chunking decisions."""
# Quality multipliers by strategy (industry benchmarks)
quality_scores = {
"fixed_size": 0.70,
"sentence": 0.82,
"semantic": 0.93,
"document_aware": 0.88
}
total_tokens = monthly_documents * avg_tokens_per_doc
step_size = int(chunk_size * (1 - overlap_pct))
chunks_per_doc = max(1, avg_tokens_per_doc // step_size)
total_chunks = monthly_documents * chunks_per_doc
# Embedding cost
embedded_tokens = total_chunks * chunk_size
embedding_cost = (embedded_tokens / 1_000_000) * embedding_price
results = {}
for strategy, quality in quality_scores.items():
# Better quality = fewer support escalations
baseline_tickets = monthly_documents * 0.05
tickets_avoided = baseline_tickets * (quality - 0.5) * 2
savings = tickets_avoided * support_ticket_cost
results[strategy] = {
"monthly_embedding_cost": round(embedding_cost, 2),
"quality_score": quality,
"tickets_avoided": int(tickets_avoided),
"monthly_savings": round(savings, 2),
"net_benefit": round(savings - embedding_cost, 2)
}
return results
# Example analysis
if __name__ == "__main__":
roi = analyze_chunking_roi(
monthly_documents=10000,
avg_tokens_per_doc=2000
)
print("=== Monthly ROI by Strategy (10K docs, 2K tokens/doc) ===\n")
print(f"{'Strategy':<18} {'Cost':<10} {'Quality':<10} {'Savings':<12} {'Net Benefit'}")
print("-" * 60)
for strategy, data in sorted(roi.items(), key=lambda x: -x[1]['net_benefit']):
print(f"{strategy:<18} ${data['monthly_embedding_cost']:<9} "
f"{data['quality_score']:<10.0%} ${data['monthly_savings']:<11} "
f"${data['net_benefit']}")
Sample Output:
Strategy Cost Quality Savings Net Benefit
------------------------------------------------------------
semantic $0.78 93% $2150.0 $2149.22
document_aware $0.78 88% $1900.0 $1899.22
sentence $0.78 82% $1600.0 $1599.22
fixed_size $0.78 70% $1000.0 $999.22
Key ROI Insights:
At 10,000 documents/month, upgrading from fixed-size to semantic chunking delivers $1,150 additional monthly savings from reduced support tickets—that's $13,800 annually for the same embedding cost.
Key Takeaways
Start with sentence-based chunking for most text content—it's the best quality/cost balance
Use document-aware chunking for structured content (code, markdown)—structure matters
Reserve semantic chunking for high-value use cases where retrieval quality directly impacts revenue
Overlap is cheap insurance—10-20% adds minimal cost but prevents context loss
Measure everything—track chunk sizes, retrieval accuracy, and costs to optimize over time
Match chunk size to query patterns—detailed questions need smaller chunks; broad questions need larger ones
Your chunking strategy is the foundation of your RAG system. Get it right, and everything downstream works better.
Top comments (0)