Why "How You Cut" Matters as Much as "What You Cut"
In the first three articles, we built a working RAG pipeline and tuned the core parameters. But if you look closely at the retrieval results, you may notice a strange phenomenon:
The answer is clearly in the document, yet the Retriever can't find it. Or it finds it, but the answer is cut in half — the LLM only sees the first half of the sentence.
The problem usually lies in the chunking step.
Chunking is essentially an information splitting strategy — how you divide a 500-page book, how large each piece is, and where you make the cuts directly determines whether the reader (here, the Retriever) can quickly find what they need.
In this article, we'll process the same technical document with four different strategies so you can see the dramatic differences that "how you cut" makes.
📎 Source Code: All experiment code is open-sourced at
llm-in-action/04-chunking-strategies. Clone it to reproduce the results.
Four Chunking Strategies at a Glance
Before diving in, here's a quick reference table to build intuition:
| Strategy | Core Idea | Pros | Cons |
|---|---|---|---|
| Fixed Size | Cut at fixed character intervals, like scissors cutting paper | Simple, uniform chunk sizes | May cut through sentences, poor semantic integrity |
| Recursive Character | Try separators in priority order: paragraph → line → sentence → word | Balances semantics and uniformity | Limited Chinese support (uses English punctuation) |
| Semantic Chunking | Compute semantic similarity between adjacent sentences, cut where similarity drops | Highly semantically coherent chunks | Requires Embedding API, higher cost |
| Document Structure | Split by Markdown/HTML heading hierarchy | Preserves document structure, retrieved chunks carry chapter context | Only works for structured documents |
Experimental Design
Test Document and Source Code
The full runnable code is available at llm-in-action/04-chunking-strategies, including:
-
chunking_compare.py— The 4-strategy comparison script -
data/sample-tech-doc.md— Sample Markdown technical document -
.env.example— Environment variable template (SemanticChunker requires an Embedding API)
Test Document
We'll use a ~5,400-character Markdown technical document titled "Microservices Architecture Design Guide," containing 7 top-level chapters with multiple level-2 and level-3 headings, covering service decomposition, communication protocols, data consistency, observability, security, and deployment.
Strategy Configurations
| Strategy | Key Configuration |
|---|---|
| Fixed Size | CharacterTextSplitter(chunk_size=512, chunk_overlap=50) |
| Recursive Character | RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""]) |
| Semantic | SemanticChunker(embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=85, sentence_split_regex=r"(?<=[。!?.?!])\s+", buffer_size=0) |
| Document Structure | MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")]) |
About
buffer_size=0: SemanticChunker defaults to concatenating neighboring sentences before computing embeddings (buffer_size=1means 1 sentence on each side). But SiliconFlow's BGE model limits single inputs to < 512 tokens, so concatenation often exceeds this. Setting it to 0 makes each sentence independent — we lose some context, but it runs stably.
Strategy 1: Fixed-Size Chunking
The Principle
The most brute-force approach: cut at fixed-length intervals regardless of content.
Imagine using scissors to snip every 512 characters. Simple and efficient, but you might cut right through the middle of a sentence.
The Code
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
separator="\n", # Prefer line breaks; hard-cut if none exist
)
chunks = splitter.split_documents(documents)
Results
| Metric | Value |
|---|---|
| Chunk count | 12 |
| Average length | 453.5 chars |
| Max length | 506 chars |
| Min length | 128 chars |
First 3 chunks:
Chunk 1 (489 chars):
# Microservices Architecture Design Guide This article covers...
Chunk 2 (504 chars):
- **Read Service vs Write Service**: In read-heavy scenarios...
Chunk 3 (457 chars):
**gRPC** is based on HTTP/2 and Protocol Buffers. Advantages:...
The Problem:
Notice how Chunk 2 starts: - **Read Service vs Write Service**... — this is the middle of a list item. Fixed-size chunking brutally cut off the list at the end of the previous chunk, so Chunk 2 starts with an incomplete list item. If the user asks "What are the advantages of read-write separation?", the Retriever might return this chunk, but the LLM sees incomplete information.
Strategy 2: Recursive Character Chunking
The Principle
Slightly smarter than fixed-size: it has a priority list of separators and tries them in order — first by paragraph (\n\n), then by line (\n), then by sentence (.), and finally by word ().
Like an experienced editor: prefer cutting at paragraph boundaries, fall back to sentence boundaries if necessary, and never cut in the middle of a word.
The Code
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
Results
| Metric | Value |
|---|---|
| Chunk count | 13 |
| Average length | 431.5 chars |
| Max length | 507 chars |
| Min length | 88 chars |
First 3 chunks:
Chunk 1 (441 chars):
# Microservices Architecture Design Guide This article covers...
Chunk 2 (452 chars):
### 1.2 Split by Technical Characteristics Besides business boundaries...
Chunk 3 (457 chars):
The most common synchronous communication methods between microservices are...
Improvement over fixed-size:
Chunk 2 now starts with ### 1.2 Split by Technical Characteristics — a complete heading. Recursive character chunking successfully cut at a heading boundary instead of slicing through a list item.
But note that the separators list uses . (English period + space), so for Chinese documents, it won't split on Chinese periods (。). Its behavior on Chinese text is therefore close to fixed-size, relying mainly on \n\n and \n.
Strategy 3: Semantic Chunking
The Principle
The previous two strategies cut by length. Semantic chunking cuts by meaning.
Here's how it works:
- Split the document into sentences
- Compute each sentence's embedding (semantic vector)
- Compare semantic similarity between adjacent sentences
- If similarity suddenly drops (below the threshold), cut there
Imagine watching a movie where the scene suddenly shifts from an office to a beach — that's a semantic boundary. Semantic chunking recognizes these "scene changes" and ensures each chunk discusses one coherent topic.
The Code
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
model="BAAI/bge-large-zh-v1.5",
api_key=os.getenv("EMBEDDING_API_KEY"),
base_url=os.getenv("EMBEDDING_API_BASE", "https://api.siliconflow.cn/v1"),
chunk_size=32, # SiliconFlow limits batch_size to 32
)
# Key: Custom Chinese sentence-splitting regex, or SemanticChunker defaults to English punctuation only
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=85,
sentence_split_regex=r"(?<=[。!?.?!])\s+",
buffer_size=0, # Avoid exceeding 512-token limit when concatenating sentences
)
chunks = splitter.split_documents(documents)
Pitfalls We Hit
Implementing semantic chunking, we ran into three issues:
Pitfall 1: Batch size exceeded
ValueError: input batch size 1000 > maximum allowed batch size 32
→ Fix: OpenAIEmbeddings(chunk_size=32)
Pitfall 2: Single-input token limit exceeded
Error code: 413 - input must have less than 512 tokens
→ Fix: Set buffer_size=0 to prevent SemanticChunker from concatenating neighboring sentences
Pitfall 3: Empty strings cause 400 errors
Error code: 400 - The parameter is invalid
→ Fix: Subclass SemanticChunker and override _get_single_sentences_list to filter empty strings
class FilteredSemanticChunker(SemanticChunker):
def _get_single_sentences_list(self, text: str) -> List[str]:
sentences = re.split(self.sentence_split_regex, text)
return [s for s in sentences if s.strip()]
Results
| Metric | Value |
|---|---|
| Chunk count | 9 (fewest) |
| Average length | 590.9 chars |
| Max length | 2047 chars |
| Min length | 17 chars |
Key Finding:
Semantic chunking produces the fewest chunks (9), but with extreme size variation — smallest 17 chars, largest 2047 chars. This confirms it's truly grouping by semantic boundaries: semantically similar sentences are merged into large chunks, while topic transitions become tiny chunks.
For example, the entire "Service Communication" chapter (REST vs gRPC vs message queues) was aggregated into one 1,189-character chunk — because it all discusses the same topic. Transition sentences between chapters became tiny fragments (like a 28-character decision tree snippet).
Strategy 4: Document Structure Chunking (Markdown Header)
The Principle
The first three strategies are like "blind folding" — they don't know the document structure and purely use text features. Document structure chunking, in contrast, "keeps its eyes open": it recognizes Markdown #, ##, ### headings and splits strictly by heading hierarchy.
Each chunk's boundary is a heading boundary: starts at one heading, ends before the next heading at the same or higher level.
The Code
from langchain_text_splitters import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
],
strip_headers=False, # Keep headings inside chunk content
)
chunks = splitter.split_text(text)
Results
| Metric | Value |
|---|---|
| Chunk count | 20 (most) |
| Average length | 266.5 chars |
| Max length | 402 chars |
| Min length | 71 chars |
Key Finding:
Document structure chunking produces the most chunks (20), but each one carries an "ID card" — metadata records which heading level it belongs to:
chunk.metadata = {
"Header 1": "Microservices Architecture Design Guide",
"Header 2": "1. Service Decomposition Strategy",
"Header 3": "1.1 Split by Business Boundary (DDD)"
}
This means during retrieval, you get not just the content but also its chapter origin. This is extremely valuable for citation tracing ("The answer comes from Chapter X of the document").
Side-by-Side Comparison
Statistics Summary
| Strategy | Chunks | Avg Length | Median | Max | Min |
|---|---|---|---|---|---|
| Fixed Size | 12 | 453.5 | 476.5 | 506 | 128 |
| Recursive Character | 13 | 431.5 | 457.0 | 507 | 88 |
| Semantic | 9 | 590.9 | 422.0 | 2047 | 17 |
| Document Structure | 20 | 266.5 | 259.0 | 402 | 71 |
Retrieval Difference for the Same Query
Suppose the user asks: "What are the anti-patterns of microservice decomposition?"
| Strategy | Retrieved Chunk | Issue |
|---|---|---|
| Fixed Size | Chunk 4 (contains partial anti-pattern content, but starts mid-sentence) | List item starts in the middle; LLM lacks full context |
| Recursive Character | Chunk 5 (fully contains "1.3 Common Anti-patterns" section) | Good, but may truncate if the section is long |
| Semantic | Chunk 3 (aggregates anti-patterns + some following content) | May include irrelevant content |
| Document Structure | Chunk 6 (exactly matches "### 1.3 Common Anti-patterns") | Best — precise structural match |
Strategy Selection Decision Matrix
| Scenario | Recommended Strategy | Reasoning |
|---|---|---|
| General technical docs (PDF/Word) | Recursive Character | Most reliable baseline, no special formatting required |
| Markdown / Papers / Books | Document Structure | Preserves chapter structure, retrievable with provenance |
| Terminology-dense docs (legal/medical) | Semantic Chunking | Semantically coherent chunks, reduces cross-topic noise |
| Ultra-high-speed chunking (real-time) | Fixed Size | Zero computation overhead, pure string operations |
| Code documentation | Recursive Character + custom separators | Split by function/class boundaries |
Selection Advice
Step 1: Start with recursive character chunking as your baseline
↓
Step 2: If documents are Markdown/HTML, try document structure chunking
↓
Step 3: If retrieval quality is unsatisfactory, upgrade to semantic chunking
(highest cost but best quality)
Summary
This article used the same document and four strategies to show you how "how you cut" affects RAG quality:
- Fixed Size: Simple but brutal. Good for rapid prototyping.
- Recursive Character: The most universal baseline. Sufficient for 80% of scenarios.
- Semantic Chunking: Best quality but highest cost. Use when precision is critical.
- Document Structure: Best choice for structured documents. Retrieved chunks carry built-in context.
Key Takeaway: There is no perfect chunking strategy — only the strategy that fits your document type and business scenario. In real projects, use the comparison script from this article, run it on your own documents, and let the data guide your decision.
Top comments (0)