Thousand Miles AI

Posted on Mar 6

Chunking Strategies That Actually Work — Why Your RAG App Retrieves Garbage

#learning #ai #softwareengineering

Fixed-size, recursive, semantic — everyone has an opinion on the 'best' chunking strategy. The 2026 benchmarks are in, and the results will surprise you. Here's what actually works and why.

Chunking Strategies That Actually Work — Why Your RAG App Retrieves Garbage

The most boring part of your RAG pipeline is also the most consequential. Get chunking wrong, and nothing downstream can save you.

The Contract That Lied

Picture this. You're building a legal document assistant. A lawyer asks: "Is the company liable for damages in cases of force majeure?" Your RAG system retrieves a chunk that confidently states: "The company is liable for all damages arising from service interruptions." Clear answer, right?

Except... the original document said: "The company is liable for all damages arising from service interruptions, except in cases of force majeure as defined in Section 12." Your chunker, set to split every 500 tokens, sliced the sentence right between "interruptions" and "except." The exception — the most important part — ended up in the next chunk. That chunk didn't get retrieved because the query was about liability, not about Section 12 definitions.

One bad split. A completely wrong answer. And the user has no idea because the retrieved chunk looked perfectly valid.

This isn't a contrived example. This pattern plays out constantly in production RAG systems. Tables split in half. Lists separated from their headers. Paragraphs that say "as mentioned above" — but "above" is in a different chunk that wasn't retrieved. Chunking errors are silent, invisible, and devastating.

Why Should You Care?

Chunking is the first real decision in any RAG pipeline, and it's the one most developers spend the least time on. Everyone obsesses over embedding models, vector databases, and LLM selection — but the quality of your chunks puts a hard ceiling on everything else. You can't retrieve what you've destroyed.

And here's what makes it interesting: the 2026 benchmarks flipped a lot of assumptions on their head. The fanciest chunking methods? They're not winning. Understanding why requires understanding what each strategy actually does — and that's knowledge that shows up in both system design interviews and production debugging.

Let Me Back Up — What Is Chunking and Why Do We Need It?

Before your documents go into a vector database, they need to be broken into smaller pieces. There are two reasons for this.

First, embedding models have token limits. Most models cap out at 512 or 8,192 tokens. You can't embed a 50-page PDF as a single unit.

Second — and this is the less obvious reason — you want precision in retrieval. If your entire document is one big chunk, any query about any topic in that document will retrieve the whole thing. The LLM then has to find the needle in a haystack. Small, focused chunks mean the retriever can surface exactly the paragraph that answers the question.

But "small and focused" creates its own problem: the smaller the chunk, the more context it loses. A chunk that says "this approach" without telling you what "this" refers to is useless. The art of chunking is finding the sweet spot between precision and context.

The chunking step sits between your raw documents and the vector database. It determines the quality of everything downstream.

The Three Strategies — Explained Like You're Pair-Programming

Strategy 1: Fixed-Size Chunking

This is the "I don't want to think about it" approach. You pick a number — say 500 tokens — and split the text every 500 tokens. Done.

It's dead simple to implement, fast to run, and produces predictable chunk sizes. Most tutorials use this as the default, which is why most beginners start here.

The problem? It has zero awareness of your text's structure. It will split mid-sentence, mid-paragraph, mid-table. That liability clause we talked about? Fixed-size chunking is exactly how it gets destroyed.

The one saving grace is overlap. By repeating the last 50–100 tokens of each chunk at the beginning of the next one, you create a buffer zone where boundary information isn't completely lost. It's a patch, not a fix — but it helps more than you'd expect.

When it works: Flat, unstructured text like logs, transcripts, or scraped web content where there's no meaningful structure to preserve.

Strategy 2: Recursive Character Splitting

This is the strategy that actually won the 2026 benchmarks — and it's not even that complicated.

Instead of blindly splitting every N tokens, recursive splitting tries a hierarchy of separators. First, it tries to split on double newlines (paragraph breaks). If the resulting chunks are still too large, it splits on single newlines. Still too large? Sentences. Still too large? Words.

The key insight: it respects natural boundaries first and only gets more aggressive when it has to. A 500-token paragraph stays intact. A 2,000-token section gets split at paragraph boundaries, not mid-sentence.

Think of it like cutting a pizza. Fixed-size is a grid pattern — equal pieces but you cut through toppings. Recursive is cutting along the natural slice lines first, and only cutting slices in half if they're too big.

Most frameworks (LangChain, LlamaIndex) use this as their RecursiveCharacterTextSplitter. The default separator hierarchy is: ["\n\n", "\n", " ", ""] — paragraphs, lines, words, characters.

When it works: Almost everything. The 2026 FloTorch benchmark tested seven strategies across thousands of documents, and recursive splitting at 512 tokens achieved the highest answer accuracy and retrieval F1 scores.

Strategy 3: Semantic Chunking

Okay, here's where it gets interesting — and controversial.

Semantic chunking uses embeddings to determine where to split. It embeds each sentence individually, then measures the similarity between consecutive sentences. When the similarity drops below a threshold — meaning the topic changed — it places a split there.

The idea is elegant: instead of splitting based on character count, you split based on meaning. Each chunk should be a coherent unit about one topic.

The problem? It's expensive and surprisingly inconsistent. You need to generate embeddings for every sentence just to decide where to split. For a large corpus, that means thousands of API calls or significant local compute before you've even started indexing.

And the 2026 benchmarks showed something counterintuitive: semantic chunking often produced worse retrieval than recursive splitting. Why? Because semantic chunks vary wildly in size. Some end up with 50 tokens (too small for meaningful embedding), others with 2,000+ tokens (too large for precise retrieval). The inconsistent size makes it harder for the retriever to compare chunks fairly.

The same legal clause, chunked three ways. Fixed-size breaks it. Recursive preserves it. Semantic groups it by topic but may bundle too much.

When it works: Multi-topic narrative documents where topics shift unpredictably — research papers, long blog posts, interview transcripts. But only if you can afford the compute and are willing to tune the similarity threshold.

The 2026 Surprise — Why Simpler Is Winning

Here's what caught everyone off guard. When comprehensive benchmarks tested all these strategies head-to-head, the ranking was:

Recursive splitting (512 tokens) — highest accuracy, highest retrieval F1
Fixed-size (512 tokens with overlap) — close second
Semantic chunking — middle of the pack
Proposition-based chunking (using LLMs to decompose) — expensive, marginally better on some tasks

The reason simpler methods won isn't that they're inherently superior — it's that they produce consistent, predictable chunk sizes. Embedding models and retrievers are optimized for chunks in the 256–512 token range. When your chunks are consistently in that range, the entire pipeline works more predictably.

Semantic and proposition-based methods also create 3–5x more chunks for the same corpus. More chunks means more embeddings, more storage, more compute, and — counterintuitively — more noise in retrieval. The cost multiplier compounds at every layer.

Does the 3% accuracy improvement from semantic chunking justify 10x the processing cost? For most applications, no.

Mistakes That Bite — The Chunking Errors Nobody Talks About

"I'll just use the default 1,000-token chunks." This is the most common error. Most tutorials and framework defaults use 1,000 tokens. But most embedding models are optimized for 256–512 tokens. Larger chunks dilute the embedding — instead of representing one specific idea, they represent a fuzzy average of several ideas. Drop to 512 with 50-token overlap and measure the difference.

"Chunking is a one-time decision." Different document types need different strategies. Your API docs might work perfectly with recursive splitting, while your meeting transcripts might need semantic chunking. Don't apply one strategy to your entire corpus blindly. Profile your document types and test each one.

"Tables and code can be chunked like prose." They absolutely cannot. A table split in half is worse than useless — it's misleading. Code split mid-function is syntactically invalid. Extract tables and code blocks as separate units, preserve their structure, and add surrounding context (the header before the table, the function name, the paragraph that references it).

Now Go Break Something — Where to Go from Here

Here's a weekend experiment that'll teach you more about chunking than any article:

Take a document you know well — your project docs, college notes, anything where you can verify the answers.
Chunk it three ways using LangChain's splitters: CharacterTextSplitter (fixed), RecursiveCharacterTextSplitter (recursive), and SemanticChunker from langchain-experimental.
Ask the same 10 questions to a RAG pipeline using each chunking strategy.
Compare the retrieved chunks side by side. You'll immediately see where fixed-size destroys context, where recursive preserves it, and where semantic produces inconsistent sizes.

For deeper exploration:

Search for "FloTorch RAG chunking benchmark 2026" — the full benchmark results with methodology
The LangChain docs have a great comparison page for all their text splitters
Check out the Weaviate blog's chunking guide — it has practical examples for different document types
For advanced work, look into "late chunking" — a newer approach where you embed the full document first and then chunk the embeddings, preserving long-range context

That legal document assistant that said "the company is liable for all damages"? It wasn't lying — it was reading the only chunk it had, and that chunk had been amputated mid-sentence. Swap to recursive splitting at 512 tokens with overlap, and the full clause — exception included — stays intact. The fix wasn't a better model or a smarter prompt. It was a better cut.

Author: Shibin

DEV Community

Chunking Strategies That Actually Work — Why Your RAG App Retrieves Garbage

Chunking Strategies That Actually Work — Why Your RAG App Retrieves Garbage

The Contract That Lied

Why Should You Care?

Let Me Back Up — What Is Chunking and Why Do We Need It?

The Three Strategies — Explained Like You're Pair-Programming

Strategy 1: Fixed-Size Chunking

Strategy 2: Recursive Character Splitting

Strategy 3: Semantic Chunking

The 2026 Surprise — Why Simpler Is Winning

Mistakes That Bite — The Chunking Errors Nobody Talks About

Now Go Break Something — Where to Go from Here

Top comments (0)