DEV Community

Cover image for RAG Series (4): Document Processing — From Raw Files to High-Quality Chunks
WonderLab
WonderLab

Posted on

RAG Series (4): Document Processing — From Raw Files to High-Quality Chunks

Why "How You Cut" Matters as Much as "What You Cut"

In the first three articles, we built a working RAG pipeline and tuned the core parameters. But if you look closely at the retrieval results, you may notice a strange phenomenon:

The answer is clearly in the document, yet the Retriever can't find it. Or it finds it, but the answer is cut in half — the LLM only sees the first half of the sentence.

The problem usually lies in the chunking step.

Chunking is essentially an information splitting strategy — how you divide a 500-page book, how large each piece is, and where you make the cuts directly determines whether the reader (here, the Retriever) can quickly find what they need.

In this article, we'll process the same technical document with four different strategies so you can see the dramatic differences that "how you cut" makes.

📎 Source Code: All experiment code is open-sourced at llm-in-action/04-chunking-strategies. Clone it to reproduce the results.


Four Chunking Strategies at a Glance

Before diving in, here's a quick reference table to build intuition:

Strategy Core Idea Pros Cons
Fixed Size Cut at fixed character intervals, like scissors cutting paper Simple, uniform chunk sizes May cut through sentences, poor semantic integrity
Recursive Character Try separators in priority order: paragraph → line → sentence → word Balances semantics and uniformity Limited Chinese support (uses English punctuation)
Semantic Chunking Compute semantic similarity between adjacent sentences, cut where similarity drops Highly semantically coherent chunks Requires Embedding API, higher cost
Document Structure Split by Markdown/HTML heading hierarchy Preserves document structure, retrieved chunks carry chapter context Only works for structured documents

Experimental Design

Test Document and Source Code

The full runnable code is available at llm-in-action/04-chunking-strategies, including:

  • chunking_compare.py — The 4-strategy comparison script
  • data/sample-tech-doc.md — Sample Markdown technical document
  • .env.example — Environment variable template (SemanticChunker requires an Embedding API)

Test Document

We'll use a ~5,400-character Markdown technical document titled "Microservices Architecture Design Guide," containing 7 top-level chapters with multiple level-2 and level-3 headings, covering service decomposition, communication protocols, data consistency, observability, security, and deployment.

Strategy Configurations

Strategy Key Configuration
Fixed Size CharacterTextSplitter(chunk_size=512, chunk_overlap=50)
Recursive Character RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""])
Semantic SemanticChunker(embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=85, sentence_split_regex=r"(?<=[。!?.?!])\s+", buffer_size=0)
Document Structure MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")])

About buffer_size=0: SemanticChunker defaults to concatenating neighboring sentences before computing embeddings (buffer_size=1 means 1 sentence on each side). But SiliconFlow's BGE model limits single inputs to < 512 tokens, so concatenation often exceeds this. Setting it to 0 makes each sentence independent — we lose some context, but it runs stably.


Strategy 1: Fixed-Size Chunking

The Principle

The most brute-force approach: cut at fixed-length intervals regardless of content.

Imagine using scissors to snip every 512 characters. Simple and efficient, but you might cut right through the middle of a sentence.

The Code

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separator="\n",  # Prefer line breaks; hard-cut if none exist
)
chunks = splitter.split_documents(documents)
Enter fullscreen mode Exit fullscreen mode

Results

Metric Value
Chunk count 12
Average length 453.5 chars
Max length 506 chars
Min length 128 chars

First 3 chunks:

Chunk 1 (489 chars):
# Microservices Architecture Design Guide This article covers...

Chunk 2 (504 chars):
- **Read Service vs Write Service**: In read-heavy scenarios...

Chunk 3 (457 chars):
**gRPC** is based on HTTP/2 and Protocol Buffers. Advantages:...
Enter fullscreen mode Exit fullscreen mode

The Problem:

Notice how Chunk 2 starts: - **Read Service vs Write Service**... — this is the middle of a list item. Fixed-size chunking brutally cut off the list at the end of the previous chunk, so Chunk 2 starts with an incomplete list item. If the user asks "What are the advantages of read-write separation?", the Retriever might return this chunk, but the LLM sees incomplete information.


Strategy 2: Recursive Character Chunking

The Principle

Slightly smarter than fixed-size: it has a priority list of separators and tries them in order — first by paragraph (\n\n), then by line (\n), then by sentence (.), and finally by word ().

Like an experienced editor: prefer cutting at paragraph boundaries, fall back to sentence boundaries if necessary, and never cut in the middle of a word.

The Code

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
Enter fullscreen mode Exit fullscreen mode

Results

Metric Value
Chunk count 13
Average length 431.5 chars
Max length 507 chars
Min length 88 chars

First 3 chunks:

Chunk 1 (441 chars):
# Microservices Architecture Design Guide  This article covers...

Chunk 2 (452 chars):
### 1.2 Split by Technical Characteristics  Besides business boundaries...

Chunk 3 (457 chars):
The most common synchronous communication methods between microservices are...
Enter fullscreen mode Exit fullscreen mode

Improvement over fixed-size:

Chunk 2 now starts with ### 1.2 Split by Technical Characteristics — a complete heading. Recursive character chunking successfully cut at a heading boundary instead of slicing through a list item.

But note that the separators list uses . (English period + space), so for Chinese documents, it won't split on Chinese periods (。). Its behavior on Chinese text is therefore close to fixed-size, relying mainly on \n\n and \n.


Strategy 3: Semantic Chunking

The Principle

The previous two strategies cut by length. Semantic chunking cuts by meaning.

Here's how it works:

  1. Split the document into sentences
  2. Compute each sentence's embedding (semantic vector)
  3. Compare semantic similarity between adjacent sentences
  4. If similarity suddenly drops (below the threshold), cut there

Imagine watching a movie where the scene suddenly shifts from an office to a beach — that's a semantic boundary. Semantic chunking recognizes these "scene changes" and ensures each chunk discusses one coherent topic.

The Code

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="BAAI/bge-large-zh-v1.5",
    api_key=os.getenv("EMBEDDING_API_KEY"),
    base_url=os.getenv("EMBEDDING_API_BASE", "https://api.siliconflow.cn/v1"),
    chunk_size=32,  # SiliconFlow limits batch_size to 32
)

# Key: Custom Chinese sentence-splitting regex, or SemanticChunker defaults to English punctuation only
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85,
    sentence_split_regex=r"(?<=[。!?.?!])\s+",
    buffer_size=0,  # Avoid exceeding 512-token limit when concatenating sentences
)
chunks = splitter.split_documents(documents)
Enter fullscreen mode Exit fullscreen mode

Pitfalls We Hit

Implementing semantic chunking, we ran into three issues:

Pitfall 1: Batch size exceeded

ValueError: input batch size 1000 > maximum allowed batch size 32
Enter fullscreen mode Exit fullscreen mode

→ Fix: OpenAIEmbeddings(chunk_size=32)

Pitfall 2: Single-input token limit exceeded

Error code: 413 - input must have less than 512 tokens
Enter fullscreen mode Exit fullscreen mode

→ Fix: Set buffer_size=0 to prevent SemanticChunker from concatenating neighboring sentences

Pitfall 3: Empty strings cause 400 errors

Error code: 400 - The parameter is invalid
Enter fullscreen mode Exit fullscreen mode

→ Fix: Subclass SemanticChunker and override _get_single_sentences_list to filter empty strings

class FilteredSemanticChunker(SemanticChunker):
    def _get_single_sentences_list(self, text: str) -> List[str]:
        sentences = re.split(self.sentence_split_regex, text)
        return [s for s in sentences if s.strip()]
Enter fullscreen mode Exit fullscreen mode

Results

Metric Value
Chunk count 9 (fewest)
Average length 590.9 chars
Max length 2047 chars
Min length 17 chars

Key Finding:

Semantic chunking produces the fewest chunks (9), but with extreme size variation — smallest 17 chars, largest 2047 chars. This confirms it's truly grouping by semantic boundaries: semantically similar sentences are merged into large chunks, while topic transitions become tiny chunks.

For example, the entire "Service Communication" chapter (REST vs gRPC vs message queues) was aggregated into one 1,189-character chunk — because it all discusses the same topic. Transition sentences between chapters became tiny fragments (like a 28-character decision tree snippet).


Strategy 4: Document Structure Chunking (Markdown Header)

The Principle

The first three strategies are like "blind folding" — they don't know the document structure and purely use text features. Document structure chunking, in contrast, "keeps its eyes open": it recognizes Markdown #, ##, ### headings and splits strictly by heading hierarchy.

Each chunk's boundary is a heading boundary: starts at one heading, ends before the next heading at the same or higher level.

The Code

from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ],
    strip_headers=False,  # Keep headings inside chunk content
)
chunks = splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Results

Metric Value
Chunk count 20 (most)
Average length 266.5 chars
Max length 402 chars
Min length 71 chars

Key Finding:

Document structure chunking produces the most chunks (20), but each one carries an "ID card" — metadata records which heading level it belongs to:

chunk.metadata = {
    "Header 1": "Microservices Architecture Design Guide",
    "Header 2": "1. Service Decomposition Strategy",
    "Header 3": "1.1 Split by Business Boundary (DDD)"
}
Enter fullscreen mode Exit fullscreen mode

This means during retrieval, you get not just the content but also its chapter origin. This is extremely valuable for citation tracing ("The answer comes from Chapter X of the document").


Side-by-Side Comparison

Statistics Summary

Strategy Chunks Avg Length Median Max Min
Fixed Size 12 453.5 476.5 506 128
Recursive Character 13 431.5 457.0 507 88
Semantic 9 590.9 422.0 2047 17
Document Structure 20 266.5 259.0 402 71

Retrieval Difference for the Same Query

Suppose the user asks: "What are the anti-patterns of microservice decomposition?"

Strategy Retrieved Chunk Issue
Fixed Size Chunk 4 (contains partial anti-pattern content, but starts mid-sentence) List item starts in the middle; LLM lacks full context
Recursive Character Chunk 5 (fully contains "1.3 Common Anti-patterns" section) Good, but may truncate if the section is long
Semantic Chunk 3 (aggregates anti-patterns + some following content) May include irrelevant content
Document Structure Chunk 6 (exactly matches "### 1.3 Common Anti-patterns") Best — precise structural match

Strategy Selection Decision Matrix

Scenario Recommended Strategy Reasoning
General technical docs (PDF/Word) Recursive Character Most reliable baseline, no special formatting required
Markdown / Papers / Books Document Structure Preserves chapter structure, retrievable with provenance
Terminology-dense docs (legal/medical) Semantic Chunking Semantically coherent chunks, reduces cross-topic noise
Ultra-high-speed chunking (real-time) Fixed Size Zero computation overhead, pure string operations
Code documentation Recursive Character + custom separators Split by function/class boundaries

Selection Advice

Step 1: Start with recursive character chunking as your baseline
    ↓
Step 2: If documents are Markdown/HTML, try document structure chunking
    ↓
Step 3: If retrieval quality is unsatisfactory, upgrade to semantic chunking
         (highest cost but best quality)
Enter fullscreen mode Exit fullscreen mode

Summary

This article used the same document and four strategies to show you how "how you cut" affects RAG quality:

  • Fixed Size: Simple but brutal. Good for rapid prototyping.
  • Recursive Character: The most universal baseline. Sufficient for 80% of scenarios.
  • Semantic Chunking: Best quality but highest cost. Use when precision is critical.
  • Document Structure: Best choice for structured documents. Retrieved chunks carry built-in context.

Key Takeaway: There is no perfect chunking strategy — only the strategy that fits your document type and business scenario. In real projects, use the comparison script from this article, run it on your own documents, and let the data guide your decision.

Top comments (0)