DEV Community

Cover image for Not all chunking is equal.
Navas Herbert
Navas Herbert

Posted on

Not all chunking is equal.

The strategy you pick at ingestion quietly determines how good your RAG retrieval will ever be - no matter how good your embedding model is.

Six strategies, ranked from baseline to production-grade:

𝗙𝗶𝘅𝗲𝗱-𝗦𝗶𝘇𝗲 - split every N characters, regardless of meaning. Fastest to implement. Cuts mid-word, mid-sentence. Baseline only.

𝗙𝗶𝘅𝗲𝗱 + 𝗢𝘃𝗲𝗿𝗹𝗮𝗽 - same as above, but consecutive chunks share a repeated tail. Reduces hard cuts. Still not semantically aware. A quick upgrade, not a real fix.

𝗥𝗲𝗰𝘂𝗿𝘀𝗶𝘃𝗲 𝗦𝗽𝗹𝗶𝘁 - tries natural separators in order: paragraph → sentence → word → character. Respects language structure. The default choice for most general-purpose RAG.

𝗠𝗮𝗿𝗸𝗱𝗼𝘄𝗻-𝗔𝘄𝗮𝗿𝗲 - splits on document headers and sections. Each chunk is one logical topic. Free section metadata for filtering. Requires structured documents.

𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 - embeds every sentence, measures similarity between consecutive ones, cuts where similarity drops sharply. Chunks align with actual meaning. Costs more at ingest time.

𝗔𝗴𝗲𝗻𝘁𝗶𝗰 / 𝗣𝗿𝗼𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻 - an LLM rewrites each piece into a self-contained atomic fact. No dangling pronouns, no lost context. Best possible retrieval quality. Most expensive.

The one thing worth remembering across all of them: overlap is a band-aid, not a fix. It reduces boundary cuts but doesn't make chunks semantically coherent.

Choose your strategy based on what you're optimizing for - ingestion speed, retrieval quality, or cost.

Sharing what I'm learning in public.

RAG #AI #Python #LLM #LearningInPublic

Top comments (0)