The strategy you pick at ingestion quietly determines how good your RAG retrieval will ever be - no matter how good your embedding model is.
Six strategies, ranked from baseline to production-grade:
𝗙𝗶𝘅𝗲𝗱-𝗦𝗶𝘇𝗲 - split every N characters, regardless of meaning. Fastest to implement. Cuts mid-word, mid-sentence. Baseline only.
𝗙𝗶𝘅𝗲𝗱 + 𝗢𝘃𝗲𝗿𝗹𝗮𝗽 - same as above, but consecutive chunks share a repeated tail. Reduces hard cuts. Still not semantically aware. A quick upgrade, not a real fix.
𝗥𝗲𝗰𝘂𝗿𝘀𝗶𝘃𝗲 𝗦𝗽𝗹𝗶𝘁 - tries natural separators in order: paragraph → sentence → word → character. Respects language structure. The default choice for most general-purpose RAG.
𝗠𝗮𝗿𝗸𝗱𝗼𝘄𝗻-𝗔𝘄𝗮𝗿𝗲 - splits on document headers and sections. Each chunk is one logical topic. Free section metadata for filtering. Requires structured documents.
𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 - embeds every sentence, measures similarity between consecutive ones, cuts where similarity drops sharply. Chunks align with actual meaning. Costs more at ingest time.
𝗔𝗴𝗲𝗻𝘁𝗶𝗰 / 𝗣𝗿𝗼𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻 - an LLM rewrites each piece into a self-contained atomic fact. No dangling pronouns, no lost context. Best possible retrieval quality. Most expensive.
The one thing worth remembering across all of them: overlap is a band-aid, not a fix. It reduces boundary cuts but doesn't make chunks semantically coherent.
Choose your strategy based on what you're optimizing for - ingestion speed, retrieval quality, or cost.
Sharing what I'm learning in public.

Top comments (0)