WonderLab

Posted on May 2

RAG Series (4): Document Processing — From Raw Files to High-Quality Chunks

#rag #ai #llm #chunk

Why "How You Cut" Matters as Much as "What You Cut"

In the first three articles, we built a working RAG pipeline and tuned the core parameters. But if you look closely at the retrieval results, you may notice a strange phenomenon:

The answer is clearly in the document, yet the Retriever can't find it. Or it finds it, but the answer is cut in half — the LLM only sees the first half of the sentence.

The problem usually lies in the chunking step.

Chunking is essentially an information splitting strategy — how you divide a 500-page book, how large each piece is, and where you make the cuts directly determines whether the reader (here, the Retriever) can quickly find what they need.

In this article, we'll process the same technical document with four different strategies so you can see the dramatic differences that "how you cut" makes.

📎 Source Code: All experiment code is open-sourced at llm-in-action/04-chunking-strategies. Clone it to reproduce the results.

Four Chunking Strategies at a Glance

Before diving in, here's a quick reference table to build intuition:

Strategy	Core Idea	Pros	Cons
Fixed Size	Cut at fixed character intervals, like scissors cutting paper	Simple, uniform chunk sizes	May cut through sentences, poor semantic integrity
Recursive Character	Try separators in priority order: paragraph → line → sentence → word	Balances semantics and uniformity	Limited Chinese support (uses English punctuation)
Semantic Chunking	Compute semantic similarity between adjacent sentences, cut where similarity drops	Highly semantically coherent chunks	Requires Embedding API, higher cost
Document Structure	Split by Markdown/HTML heading hierarchy	Preserves document structure, retrieved chunks carry chapter context	Only works for structured documents

Experimental Design

Test Document and Source Code

The full runnable code is available at llm-in-action/04-chunking-strategies, including:

chunking_compare.py — The 4-strategy comparison script
data/sample-tech-doc.md — Sample Markdown technical document
.env.example — Environment variable template (SemanticChunker requires an Embedding API)

Test Document

We'll use a ~5,400-character Markdown technical document titled "Microservices Architecture Design Guide," containing 7 top-level chapters with multiple level-2 and level-3 headings, covering service decomposition, communication protocols, data consistency, observability, security, and deployment.

Strategy Configurations

Strategy	Key Configuration
Fixed Size	`CharacterTextSplitter(chunk_size=512, chunk_overlap=50)`
Recursive Character	`RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""])`
Semantic	`SemanticChunker(embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=85, sentence_split_regex=r"(?<=[。！？.?!])\s+", buffer_size=0)`
Document Structure	`MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")])`

About buffer_size=0: SemanticChunker defaults to concatenating neighboring sentences before computing embeddings (buffer_size=1 means 1 sentence on each side). But SiliconFlow's BGE model limits single inputs to < 512 tokens, so concatenation often exceeds this. Setting it to 0 makes each sentence independent — we lose some context, but it runs stably.

Strategy 1: Fixed-Size Chunking

The Principle

The most brute-force approach: cut at fixed-length intervals regardless of content.

Imagine using scissors to snip every 512 characters. Simple and efficient, but you might cut right through the middle of a sentence.

The Code

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separator="\n",  # Prefer line breaks; hard-cut if none exist
)
chunks = splitter.split_documents(documents)

Results

Metric	Value
Chunk count	12
Average length	453.5 chars
Max length	506 chars
Min length	128 chars

First 3 chunks:

Chunk 1 (489 chars):
# Microservices Architecture Design Guide This article covers...

Chunk 2 (504 chars):
- **Read Service vs Write Service**: In read-heavy scenarios...

Chunk 3 (457 chars):
**gRPC** is based on HTTP/2 and Protocol Buffers. Advantages:...

The Problem:

Notice how Chunk 2 starts: - **Read Service vs Write Service**... — this is the middle of a list item. Fixed-size chunking brutally cut off the list at the end of the previous chunk, so Chunk 2 starts with an incomplete list item. If the user asks "What are the advantages of read-write separation?", the Retriever might return this chunk, but the LLM sees incomplete information.

Strategy 2: Recursive Character Chunking

The Principle

Slightly smarter than fixed-size: it has a priority list of separators and tries them in order — first by paragraph (\n\n), then by line (\n), then by sentence (.), and finally by word ().

Like an experienced editor: prefer cutting at paragraph boundaries, fall back to sentence boundaries if necessary, and never cut in the middle of a word.

The Code

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)

Results

Metric	Value
Chunk count	13
Average length	431.5 chars
Max length	507 chars
Min length	88 chars

First 3 chunks:

Chunk 1 (441 chars):
# Microservices Architecture Design Guide  This article covers...

Chunk 2 (452 chars):
### 1.2 Split by Technical Characteristics  Besides business boundaries...

Chunk 3 (457 chars):
The most common synchronous communication methods between microservices are...

Improvement over fixed-size:

Chunk 2 now starts with ### 1.2 Split by Technical Characteristics — a complete heading. Recursive character chunking successfully cut at a heading boundary instead of slicing through a list item.

But note that the separators list uses . (English period + space), so for Chinese documents, it won't split on Chinese periods (。). Its behavior on Chinese text is therefore close to fixed-size, relying mainly on \n\n and \n.

Strategy 3: Semantic Chunking

The Principle

The previous two strategies cut by length. Semantic chunking cuts by meaning.

Here's how it works:

Split the document into sentences
Compute each sentence's embedding (semantic vector)
Compare semantic similarity between adjacent sentences
If similarity suddenly drops (below the threshold), cut there

Imagine watching a movie where the scene suddenly shifts from an office to a beach — that's a semantic boundary. Semantic chunking recognizes these "scene changes" and ensures each chunk discusses one coherent topic.

The Code

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="BAAI/bge-large-zh-v1.5",
    api_key=os.getenv("EMBEDDING_API_KEY"),
    base_url=os.getenv("EMBEDDING_API_BASE", "https://api.siliconflow.cn/v1"),
    chunk_size=32,  # SiliconFlow limits batch_size to 32
)

# Key: Custom Chinese sentence-splitting regex, or SemanticChunker defaults to English punctuation only
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85,
    sentence_split_regex=r"(?<=[。！？.?!])\s+",
    buffer_size=0,  # Avoid exceeding 512-token limit when concatenating sentences
)
chunks = splitter.split_documents(documents)

Pitfalls We Hit

Implementing semantic chunking, we ran into three issues:

Pitfall 1: Batch size exceeded

ValueError: input batch size 1000 > maximum allowed batch size 32

→ Fix: OpenAIEmbeddings(chunk_size=32)

Pitfall 2: Single-input token limit exceeded

Error code: 413 - input must have less than 512 tokens

→ Fix: Set buffer_size=0 to prevent SemanticChunker from concatenating neighboring sentences

Pitfall 3: Empty strings cause 400 errors

Error code: 400 - The parameter is invalid

→ Fix: Subclass SemanticChunker and override _get_single_sentences_list to filter empty strings

class FilteredSemanticChunker(SemanticChunker):
    def _get_single_sentences_list(self, text: str) -> List[str]:
        sentences = re.split(self.sentence_split_regex, text)
        return [s for s in sentences if s.strip()]

Results

Metric	Value
Chunk count	9 (fewest)
Average length	590.9 chars
Max length	2047 chars
Min length	17 chars

Key Finding:

Semantic chunking produces the fewest chunks (9), but with extreme size variation — smallest 17 chars, largest 2047 chars. This confirms it's truly grouping by semantic boundaries: semantically similar sentences are merged into large chunks, while topic transitions become tiny chunks.

For example, the entire "Service Communication" chapter (REST vs gRPC vs message queues) was aggregated into one 1,189-character chunk — because it all discusses the same topic. Transition sentences between chapters became tiny fragments (like a 28-character decision tree snippet).

Strategy 4: Document Structure Chunking (Markdown Header)

The Principle

The first three strategies are like "blind folding" — they don't know the document structure and purely use text features. Document structure chunking, in contrast, "keeps its eyes open": it recognizes Markdown #, ##, ### headings and splits strictly by heading hierarchy.

Each chunk's boundary is a heading boundary: starts at one heading, ends before the next heading at the same or higher level.

The Code

from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ],
    strip_headers=False,  # Keep headings inside chunk content
)
chunks = splitter.split_text(text)

Results

Metric	Value
Chunk count	20 (most)
Average length	266.5 chars
Max length	402 chars
Min length	71 chars

Key Finding:

Document structure chunking produces the most chunks (20), but each one carries an "ID card" — metadata records which heading level it belongs to:

chunk.metadata = {
    "Header 1": "Microservices Architecture Design Guide",
    "Header 2": "1. Service Decomposition Strategy",
    "Header 3": "1.1 Split by Business Boundary (DDD)"
}

This means during retrieval, you get not just the content but also its chapter origin. This is extremely valuable for citation tracing ("The answer comes from Chapter X of the document").

Side-by-Side Comparison

Statistics Summary

Strategy	Chunks	Avg Length	Median	Max	Min
Fixed Size	12	453.5	476.5	506	128
Recursive Character	13	431.5	457.0	507	88
Semantic	9	590.9	422.0	2047	17
Document Structure	20	266.5	259.0	402	71

Retrieval Difference for the Same Query

Suppose the user asks: "What are the anti-patterns of microservice decomposition?"

Strategy	Retrieved Chunk	Issue
Fixed Size	Chunk 4 (contains partial anti-pattern content, but starts mid-sentence)	List item starts in the middle; LLM lacks full context
Recursive Character	Chunk 5 (fully contains "1.3 Common Anti-patterns" section)	Good, but may truncate if the section is long
Semantic	Chunk 3 (aggregates anti-patterns + some following content)	May include irrelevant content
Document Structure	Chunk 6 (exactly matches "### 1.3 Common Anti-patterns")	Best — precise structural match

Strategy Selection Decision Matrix

Scenario	Recommended Strategy	Reasoning
General technical docs (PDF/Word)	Recursive Character	Most reliable baseline, no special formatting required
Markdown / Papers / Books	Document Structure	Preserves chapter structure, retrievable with provenance
Terminology-dense docs (legal/medical)	Semantic Chunking	Semantically coherent chunks, reduces cross-topic noise
Ultra-high-speed chunking (real-time)	Fixed Size	Zero computation overhead, pure string operations
Code documentation	Recursive Character + custom separators	Split by function/class boundaries

Selection Advice

Step 1: Start with recursive character chunking as your baseline
    ↓
Step 2: If documents are Markdown/HTML, try document structure chunking
    ↓
Step 3: If retrieval quality is unsatisfactory, upgrade to semantic chunking
         (highest cost but best quality)

Summary

This article used the same document and four strategies to show you how "how you cut" affects RAG quality:

Fixed Size: Simple but brutal. Good for rapid prototyping.
Recursive Character: The most universal baseline. Sufficient for 80% of scenarios.
Semantic Chunking: Best quality but highest cost. Use when precision is critical.
Document Structure: Best choice for structured documents. Retrieved chunks carry built-in context.

Key Takeaway: There is no perfect chunking strategy — only the strategy that fits your document type and business scenario. In real projects, use the comparison script from this article, run it on your own documents, and let the data guide your decision.

DEV Community

RAG Series (4): Document Processing — From Raw Files to High-Quality Chunks

Why "How You Cut" Matters as Much as "What You Cut"

Four Chunking Strategies at a Glance

Experimental Design

Test Document and Source Code

Test Document

Strategy Configurations

Strategy 1: Fixed-Size Chunking

The Principle

The Code

Results

Strategy 2: Recursive Character Chunking

The Principle

The Code

Results

Strategy 3: Semantic Chunking

The Principle

The Code

Pitfalls We Hit

Results

Strategy 4: Document Structure Chunking (Markdown Header)

The Principle

The Code

Results

Side-by-Side Comparison

Statistics Summary

Retrieval Difference for the Same Query

Strategy Selection Decision Matrix

Selection Advice

Summary

Top comments (0)