Alex Cloudstar

Posted on May 6 • Originally published at alexcloudstar.com

RAG Chunking Strategies In Production 2026: What Actually Survives Real Documents And Real Queries

#ai #architecture #productivity #devtools

The first RAG system I shipped chunked every document at 512 tokens with a 50 token overlap, because that was the example in the tutorial I was reading at three in the morning. It worked well enough to ship. It worked poorly enough that two weeks later a customer support engineer pinged me with a screenshot of the assistant confidently citing a policy document, except the cited paragraph was the second half of one policy glued to the first half of an unrelated one. The model had retrieved a chunk that crossed a section boundary, and the chunk read like a single coherent rule that did not exist anywhere in the source. Fixing that one bug took longer than building the original retriever.

That was a few years ago. The pattern has not changed. Teams still ship RAG systems where the LLM is sophisticated, the embedding model is fine, the vector store is overkill for the data volume, and the chunker is a one-line call to a default splitter that tears documents apart at arbitrary character offsets. The retrieval looks like it is working in the demo, because the demo uses clean Wikipedia paragraphs. It stops working the moment the documents are real, which means messy, inconsistent, structurally meaningful, and full of edge cases the default chunker has never seen.

By 2026 the production patterns for chunking have settled. They are not glamorous. They are mostly about respecting the structure the document already has, sizing chunks to match how the embedding model thinks, and making the retrieval shape match the queries you actually expect. This post is what I would tell my past self before that 3 a.m. tutorial, and what I would build into any retrieval pipeline before its first real user.

Chunking Is The Hidden Half Of RAG

The framing most teams start with is that RAG is about retrieval and generation, with chunking somewhere in the wiring. That framing is wrong. The chunker decides what answers can possibly be found, because the unit of retrieval is the chunk. If the right answer lives in a span the chunker split in half, the retriever cannot return it intact, and the model cannot cite it. Every other component in the pipeline is downstream of the chunking choice.

This is the same lesson I keep relearning in every retrieval project. You can change the embedding model, swap the vector store, tune the top-k, add a reranker, and you are still bottlenecked by whether the chunks contain the answers the user asks about. A great LLM cannot answer from a chunk that does not contain the relevant information. A great embedding model cannot match a query to a chunk where the answer is split across two retrievable units. The chunker is the floor, and most teams ship with that floor lower than they realize.

The reason it stays hidden is that chunking failures are silent. The system returns plausible-looking citations, the model produces fluent answers, and only a careful read of the source documents reveals that the answer is wrong, or partial, or stitched together from the wrong context. Compare that to a pipeline where the embedding model is broken: queries return obvious garbage, on-call gets paged, the bug is fixed in an afternoon. Chunking bugs do not page anyone. They show up as a slow drift in answer quality and an unhappy customer support engineer who does not know how to file the ticket.

Fixed-Size Chunking Is The Default For A Reason, And A Trap For Another

The default everybody starts with is fixed-size chunking. Pick a chunk size, pick an overlap, slide a window across the document. It is one line of code. It works on any document type. It produces predictable chunk counts and predictable storage costs. There is a real reason this pattern is the default, and there is a real reason it stops being good enough the moment the documents have any structure at all.

The strength of fixed-size chunking is that it is uniform. Every chunk is the same size, every chunk has the same overlap with its neighbors, and the embedding model sees inputs in a consistent shape. That uniformity matters more than people give it credit for. Embedding quality is sensitive to chunk size, and a pipeline where chunks vary wildly in length produces vectors that are not directly comparable. A 50-token chunk and a 2000-token chunk live in different parts of the embedding space, even if they describe the same topic, because the model encodes density and breadth differently. Fixed-size chunking sidesteps that problem by pretending everything is the same shape.

The weakness is the part everybody hits within a week of shipping. Fixed-size chunking ignores the structure of the document. It splits in the middle of sentences, in the middle of code blocks, between a heading and the section it introduces, between a question and its answer. The overlap parameter is supposed to paper over this, but overlap is a band-aid. A 50-token overlap on a 512-token chunk gives the next chunk a small lead-in to the previous one, but it does not preserve the boundary that mattered, which was the section heading. The retriever finds the body but loses the title that explained what the body was about.

The pattern that has worked when I am stuck with fixed-size chunking is to preprocess aggressively. Before the splitter runs, I prepend every chunk with the document title and the nearest preceding heading. The chunker still cuts where it cuts, but the chunk now carries enough context that the embedding can place it in the right neighborhood. This is a hack, and it works, and it is almost always worth the small storage hit. The chunk that says "from a document titled X, in a section about Y, the following text..." retrieves better than the chunk that starts mid-paragraph with no signal of where it came from.

Structure-Aware Chunking Is Where Production Lives

The next step up, and the one most production systems should be at, is to chunk along the structure the document already carries. Markdown documents have headings. HTML has tags. PDFs have pages and, with the right parser, sections. Code has functions and classes. Notion pages, Confluence pages, and most internal documentation systems expose a structural tree if you ask nicely. Use it.

The pattern is to split at structural boundaries first, then post-process to merge or further split based on size constraints. A markdown document becomes a tree of sections, each section becomes a candidate chunk, and any section that exceeds the embedding model's effective context gets recursively split along sub-headings. Sections that are too small get merged with their neighbors, but only their structural neighbors, never across a top-level heading. The output is chunks that respect the author's intent: each chunk is a thing the author wrote as a unit, not a slice of arbitrary text.

The benefit shows up in retrieval quality, but it also shows up in citation quality. When a structural chunk is retrieved, the model can cite the section heading directly. The user can see "this answer comes from Section 4.2 of the Refunds Policy" instead of "this answer comes from chunk 137." That is a product feature. Users trust citations they can verify. Citations that point to recognizable structural units are easier to verify than citations that point to opaque ranges.

The trap with structure-aware chunking is that the structural parser has to be good. A bad markdown parser will mistake a code block for a heading and chunk wrong. A bad PDF parser will fail to find sections in a document where the section breaks are visual rather than semantic, which is most real PDFs. Investing in the parser is the unglamorous part of this work. The right move is to spend a day looking at how your parser actually splits a representative sample of your documents, and to fix the cases where it is wrong. The fixes pay back for the lifetime of the index.

Semantic Chunking Sounds Smart, Mostly Is Not

There is a class of chunking strategies marketed as "semantic" that try to use embeddings or a small model to find natural break points in the text. The pitch is that the chunker reads the document, notices where the topic shifts, and cuts there. The pitch is correct in theory. In practice, semantic chunking works well on a narrow set of documents and poorly on most of the rest, and the cost is high enough that the trade is rarely good.

Where it works is on flowing prose without explicit structure. Long-form articles, transcripts, books. The structural signals are absent, the topic shifts are real, and a semantic chunker can find a cut point that a fixed-size chunker would miss. If the entire corpus is documents like this, semantic chunking is worth the engineering cost.

Where it fails is everything else. On structured documents the semantic chunker fights with the structure. The headings already mark topic shifts, and the embedding-based detector is noisy enough to put cuts in places where the author did not intend cuts. On code, on logs, on FAQs, on transactional documents, semantic chunking adds latency and cost without measurable retrieval improvements. The teams I have seen ship semantic chunking and keep it are the ones whose corpus is dominated by long prose. Everybody else has either ripped it out or quietly downgraded to structure-aware with semantic-style heuristics for the rare cases where it matters.

The compromise that works is to use a semantic detector only as a fallback. If a structural chunk is too long to fit the embedding model's window, use a semantic detector to find the best cut point inside it. That keeps the cost bounded and the benefit targeted at the cases where structure has run out.

Hierarchical Chunking And The Parent-Child Pattern

The pattern that has earned its place in production over the last two years is hierarchical chunking, sometimes called the parent-child or small-to-big pattern. The idea is to chunk at two granularities. Small chunks, sized for retrieval, are what the embedding model and the vector store see. Large chunks, sized for context, are what the LLM sees when a small chunk is retrieved. The retrieval index points from the small chunk to its parent.

The reason this works is that retrieval and generation have different sweet spots. Retrieval works best on chunks small enough that the embedding represents a single coherent idea. The vector for a 200-token chunk about how to issue a refund is sharp. The vector for a 2000-token chunk that contains that same idea plus four other ideas is blurred, because the embedding has to average over all of them. Generation, on the other hand, works best with more context, because the model needs the surrounding details to produce a complete answer.

The hierarchical pattern lets you have both. The retriever finds the precise small chunk that matches the query. The pipeline then expands to the parent, which is the section or the page or the document, and sends that to the LLM. The model gets the precision of the small chunk's match and the context of the parent's surroundings. The cost is a little extra storage for the parent text, which is rounding error in any production vector store.

The discipline is to set the parent boundary at a level that means something. Parents that are entire documents are usually too big. Parents that are paragraphs are usually too small. The right level is almost always the structural level: a section in a markdown doc, a page in a PDF, a function in a code file. The parent is the unit a human would point to when asked "where did this come from."

The same discipline I covered in RAG vs long context applies here, because hierarchical chunking is partly an answer to the question of how much context to send. The retrieval narrows the search. The parent expansion gives the model enough surrounding text to produce a grounded answer. Tuning the small-chunk size and the parent size independently is one of the highest-leverage tuning operations in a RAG pipeline.

Chunk Size: The Number Everyone Asks About And The Wrong One To Optimize First

The first question every team asks is what chunk size to use. The honest answer is that it depends on the embedding model, the document type, and the query shape, and the fastest way to get to a good number is to start at 256 to 512 tokens and adjust by measuring. Anchoring to a number before measuring is how teams end up with a confidently wrong setting.

Embedding models have an effective context that is shorter than their advertised maximum. A model with a 8192-token context window does not produce equally good embeddings for 8192-token chunks as it does for 512-token chunks. The longer the input, the more the embedding has to compress, and the more semantic detail gets lost in the averaging. The advertised context is the limit, not the recommendation. The recommendation is usually a few hundred tokens, sometimes up to a thousand for newer models. Check the model card. Then verify on your own data, because model cards are written for a benchmark and not for your corpus.

Document type matters because chunk size interacts with information density. Technical documentation packs ideas tightly: a 256-token chunk of API reference can contain three or four distinct facts. Narrative content is sparser: a 256-token chunk of a blog post might contain half of a single argument. The right chunk size for the dense corpus is smaller, because the embedding can capture the multi-fact density at smaller sizes. The right chunk size for the sparse corpus is larger, because cutting too small leaves the chunks without enough signal to retrieve.

Query shape matters because the chunk has to answer the kind of question users ask. If the queries are precise lookups ("what is the refund window for product X"), small chunks win, because the answer is a single fact and small chunks isolate facts. If the queries are exploratory ("how does our refund process work"), larger chunks win, because the answer needs context the user is implicitly asking the system to assemble. Most production systems get a mix of both, and the right move is hierarchical chunking, which sidesteps the choice.

Overlap: The Knob That Matters Less Than You Think

The other parameter every tutorial mentions is overlap. The standard advice is to overlap chunks by 10 to 20 percent. The standard advice is fine and almost never the difference between a working system and a broken one. Overlap is a small lever, and tuning it is one of the last things to do.

The reason overlap exists is to handle the case where the answer to a query straddles a chunk boundary. With no overlap, the answer is split between two chunks, and neither chunk is a great match for the query. With overlap, one of the two chunks contains the full answer, and the retriever can find it. This is real, and overlap helps, and the help is bounded.

The case where overlap stops helping is when the chunk boundaries are wrong in the first place. Adding overlap to a fixed-size chunker that splits in the middle of sentences does not produce chunks that respect sentence boundaries. It produces chunks that share a few sentences with their neighbors and still split mid-sentence at the start and end. The fix is not more overlap. The fix is structure-aware chunking that does not split mid-sentence.

The other case where overlap is wasted is when the chunk size is already large enough that boundary-straddling answers are rare. A 2000-token chunk almost never has its answer split across the boundary, because almost any answer fits inside it. Spending storage on overlap at that size is paying for an edge case that does not happen.

The pattern I default to is small overlap, around 10 percent, on smallish chunks, around 256 to 512 tokens. It is a sensible setting that does not need tuning unless something else in the pipeline forces it. If the retrieval quality is bad, do not start by tuning overlap. Start by looking at whether the chunks themselves make sense.

Metadata Is The Multiplier

The chunk text is not the only thing you store. Every chunk should carry metadata that lets the retriever filter, the reranker reason, and the LLM cite. Document title. Section heading. Source URL. Author. Publication date. Document type. Tags. Whatever your system has that distinguishes documents from each other.

Metadata pays back in three places. First, in retrieval, where filters cut the search space and improve precision. A query about a 2024 policy should not return a chunk from a 2020 policy, no matter how semantically similar the text is. A metadata filter on date solves that without any embedding-side work. Second, in reranking, where the metadata becomes additional features the reranker can weight. Recent documents, authoritative sources, official policies score higher. Third, in citation, where the metadata is what the LLM uses to tell the user where the answer came from. A citation is only as good as the metadata behind it.

The pattern that has worked is to over-collect metadata at chunking time and decide later what to use. Storage is cheap. Re-chunking the corpus to add a missing field is expensive. If the source has it, capture it. The first time you need to filter by something you did not capture is the day you regret not capturing it.

Tables, Code, And Other Things That Break Default Chunkers

Default chunkers handle prose. They do not handle tables, code blocks, lists with structural meaning, or multi-column PDFs. Each of these requires a different strategy, and each of them shows up in real corpora, and each of them silently degrades retrieval if you do not address them.

Tables are the worst offender. A table chunked by character count loses its row structure and becomes a stream of cells the embedding model cannot interpret. The fix is to detect tables before chunking and serialize them in a format that preserves structure. Markdown tables, JSON arrays of row objects, or natural-language summaries of the table contents all work, with different trade-offs. The summary approach is the highest quality and the highest cost, because it requires running the table through a small model. The markdown approach is cheaper and works for most queries that ask about the table's contents.

Code blocks should be chunked by the structure of the code, not by line count. A function or class is the natural unit. Chunking in the middle of a function produces chunks that have neither the signature nor the implementation, and the embedding represents nothing useful. Most languages have AST parsers that can extract function-level chunks cleanly. The investment pays back in code-search quality, which is otherwise terrible.

Multi-column PDFs are the failure mode that catches every team that ships RAG against scanned documents. The default text extractor reads top-to-bottom, left-to-right, which produces a stream where the first sentence of column one is followed by the first sentence of column two. The chunks are gibberish. The fix is a layout-aware extractor that respects columns, of which there are several open and commercial options as of 2026. Pick one, evaluate on your corpus, switch.

How To Know Your Chunking Is Wrong

The hardest part of chunking is that the failure signal is buried in answer quality, which is hard to measure and slow to surface. The discipline is to build a small evaluation set early, before the chunker is locked in, and to run it on every chunking change.

The eval set is a list of representative queries with known correct answers and known correct source spans in the corpus. For each query, the eval measures whether the retrieval returned the chunk containing the correct span, and whether the LLM produced an answer matching the expected one. This is the same evals discipline I covered in AI evals for solo developers, applied to the retrieval-and-generation pipeline as a unit.

The chunking-specific signal to watch is recall at k. If the correct chunk is in the top 10 results most of the time, the chunker is doing its job. If the correct chunk is missing from the top 10 even when the embedding model is solid and the query is clear, the chunker has split the answer in a way that breaks retrieval. That signal is much faster to act on than answer quality, because it points directly at the chunking step.

The other signal is qualitative. Read the chunks. Take a sample of fifty chunks at random and read them as if you were the embedding model. Do they make sense as standalone units? Do they cut off mid-thought? Do they have enough context to be retrievable? Five minutes of reading chunks beats five hours of tuning hyperparameters, every time, and most teams skip it because it does not feel like engineering. It is the most engineering thing you can do at this layer.

What I Would Build From Scratch In 2026

If I were starting a RAG pipeline today, the chunker would be structure-aware, hierarchical, with metadata enrichment, with a small overlap, with special handling for tables and code, and with an eval set running on every change. The chunk size would be a few hundred tokens for retrieval, with parents at the section or page level for generation. The fixed-size fallback would only kick in for unstructured prose, and even then with title and heading prepended to every chunk. The semantic chunker would be a fallback inside the structural chunker, used only when a structural unit was too large to embed cleanly.

That stack is not novel. It is the stack the production teams I trust have converged on, and it is unglamorous in the same way the guardrails layer is unglamorous and the observability layer is unglamorous. The interesting work is at the LLM, the visible improvements are at the LLM, and the actual quality ceiling sits at the chunker. Most of the wins in a RAG system over the next year are going to come from teams realizing this and putting an engineer on the chunking layer for a week instead of swapping models for the third time.

If your RAG system is producing answers that look right but feel slightly off, the answer is almost never the LLM. It is almost always the chunker, doing exactly what you told it to do, on documents that did not deserve to be cut where they got cut. Fixing that is the highest-leverage thing you can do in retrieval, and it is sitting there, waiting for somebody to read fifty chunks and notice.

DEV Community