Md Ayan Arshad

Posted on May 4

I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

#ai #discuss #programming #datascience

I assumed chunking was a solved problem. Pick a text splitter, set 512 tokens, add some overlap, move on. After running structured experiments across three different data types, that assumption collapsed. The best chunker for markdown documentation actively hurt performance on code. The winner changed completely depending on what I was chunking.

TL;DR

Data type	Winner	Headline metric
Markdown docs	HeadingAwareChunker	MRR 0.755 vs SlidingWindow 0.687
PDFs	RecursiveChar (512 tok)	Context Recall 0.9250, RAGAS SUM 3.4249
GitHub code	CodeBlockAwareChunker	RAGAS SUM 3.5680 — highest across all experiments

RecursiveChar won on PDFs. The same chunker scored 0.5690 Context Precision on code, roughly half the retrieved chunks were irrelevant. There is no universal best chunker. The data type decides.

What I was building

A RAG system that ingests documentation sites, PDFs, and GitHub repositories for multiple tenants, then answers developer questions with citations. Before embedding anything, I had to decide how to chunk each source type.

The standard advice is "use a recursive text splitter." Every tutorial does this. But markdown docs have headings, PDFs have paragraphs, code has functions. A function is a complete semantic unit, splitting it at token 256 and you've lost the return type, the error handling, the docstring. None of that is recoverable at query time

So I ran experiments, one variable changed per experiment: the chunker

The embedding model, retrieval method, reranker, LLM, and eval set stayed fixed.
RAGAS scored every pipeline on the same frozen question set.

3 data types, 3 experiments and here's what happened.

The full implementation, experiment notebooks, and eval sets are on GitHub

Experiment 1: Documentation (.md / .mdx)

Corpus: FastAPI and Supabase documentation, 78 QA pairs generated by GPT-4o, frozen after generation

Chunkers tested: HeadingAwareChunker (HAC), SlidingWindow-128, RecursiveChar, SemanticBlock

Key metric: MRR (Mean Reciprocal Rank), recall@5 tells you if the answer is somewhere in the top 5. MRR tells you if it's at rank 1, whether the right chunk comes first, not just eventually

Chunker	MRR (no reranker)	Chunks produced
HeadingAwareChunker	0.755	127
SlidingWindow-128	0.687	259

HAC produced the same Recall@5 as SlidingWindow (~0.82) but with significantly better MRR. The right answer appeared at rank 1 more often. And HAC did it with 127 chunks versus SlidingWindow's 259, half the chunks, better ranking, cheaper retrieval

Why? Markdown documentation is already structured by headings. Each section covers one concept, one API endpoint, one configuration option. HAC splits exactly at those heading boundaries. SlidingWindow ignores them entirely, it cuts at token count, which means a chunk might start halfway through one concept and end halfway through the next.

The embedding model then has to encode a chunk that mixes two ideas. The resulting vector is somewhere between them, and retrieval becomes imprecise.

Winner: HeadingAwareChunker.

Experiment 2: PDFs

Corpus: 5 technical PDFs from FastAPI concepts, Kubernetes architecture, React patterns, Stripe API reference, AWS overview along with 40 QA pairs

Chunkers tested: SlidingWindow-128, SemanticBlock, RecursiveChar (512 tokens, 50 overlap). HeadingAwareChunker was not included here, pymupdf4llm extracts PDFs to Markdown, but the heading hierarchy in PDFs is inconsistent across documents. Font-size-based heading detection is fragile enough that HAC's boundaries would be unreliable. The experiment focused on chunkers that work on paragraph level structure, which is what the extraction reliably produces

Chunker	Context Recall	RAGAS SUM
RecursiveChar	0.9250	3.4249
SlidingWindow-128	0.8750	3.3691
SemanticBlock	0.8167	3.2627

RecursiveChar won by a clear margin. Context Recall 0.9250 versus SlidingWindow's 0.8750.

The reason is specific to how I extracted the PDFs. I used pymupdf4llm, which converts PDFs to Markdown. The output is clean paragraphs with heading markers. RecursiveChar's default split points, double newlines, single newlines and it aligns naturally with those paragraph boundaries. It didn't need to classify blocks or detect headings. The structure was already there; RC just respected it.

SemanticBlock failed on the Stripe API PDF. That document's navigation sidebar produced 12-token noise chunks, fragment after fragment of menu items. Those wasted retrieval slots on every single query.

Winner: RecursiveChar.

Note what just happened: HAC won on docs, RC won on PDFs. Two different data types, two different winners and the experiments are only half done

Experiment 3: GitHub code

Corpus: encode/httpx repository having 90 files (60 Python, 29 Markdown, 1 text). 50 QA pairs focused on function behavior, parameters, and return values

Chunkers tested: CodeBlockAwareChunker (CBAC), RecursiveChar, SlidingWindow-128

Chunker	Ctx Precision	Ctx Recall	RAGAS SUM
CodeBlockAwareChunker	0.7812	0.9700	3.5680
SlidingWindow-128	0.8278	0.9150	3.4957
RecursiveChar	0.5690	0.9400	3.2856

RecursiveChar scored 0.5690 on Context Precision, that means roughly half of the retrieved chunks were irrelevant to the question. The same chunker that won on PDFs failed on code.

The failure mode is direct. Python code is full of blank lines between a function's docstring and its body, between logical sections inside a method, between a guard clause and the main logic. RecursiveChar splits at blank lines. So it routinely bundled two or three unrelated functions into a single chunk, averaging 457 tokens. When someone asks "what does Client.send() return," the retrieved chunk contains send() plus get() plus the __init__ method. Everything but a focused answer.

CBAC doesn't use blank lines. For Python files, it uses the ast module, it finds the exact byte offset of every function and class definition in the syntax tree, then extracts each one as a separate chunk. Zero false splits. The average chunk was 120 tokens, one complete function.

SlidingWindow 128 had the best Context Precision (0.8278), small windows avoid the bundling problem. But it split functions mid-body. A function's return value might land in the next window. That killed Recall: 0.9150 versus CBAC's 0.9700.

CBAC with a full reranker pipeline achieved RAGAS SUM 3.7079, the highest score across all experiments in this project and the PDF best was 3.4843

Winner: CodeBlockAwareChunker.

Why the results differ and why they shouldn't surprise you

Each experiment picked a different chunker, but every result points at the same question: what is the natural semantic unit of this data?

For markdown documentation, it's the section under a heading. That's a discrete concept, authored that way intentionally.

For PDFs extracted to Markdown, it's the paragraph. The extraction tool already produces those boundaries. The chunker just has to respect them.

For code, it's the function or class. A function is the smallest unit of behavior that makes sense alone. Split it and the chunk becomes meaningless without the surrounding context.

Text splitters, recursive or sliding window, don't know any of this. They operate on character counts, token counts, or blank lines. None of those correspond to semantic boundaries in code. That's the root cause of RecursiveChar's 0.5690 Context Precision. It wasn't a hyperparameter problem. It was a conceptual mismatch.

There's also a second effect worth naming: chunk count matters. HAC's 127 chunks versus SlidingWindow's 259 on the same corpus is not a coincidence. Fewer chunks means fewer candidates for noise to enter the retrieval pool. The embedding space is less diluted and rank 1 is cleaner

What I learned

The optimal chunker is determined by the data type, not by chunk size or overlap settings
RecursiveChunker's blank-line heuristic is a real liability for code, 0.5690 Context Precision proves it
Smaller average chunks (120 tokens) outperformed larger ones (457 tokens) on code by a significant margin, chunk size is a symptom, not a cause
Visual inspection of actual chunks before running RAGAS catches structural bugs that aggregate scores smooth over, I caught CBAC producing 8KB chunks on Go files before the experiment ran
Freezing the eval set before the first experiment is non negotiable because regenerating it mid experiment would invalidate every comparison

The practical takeaway

There is no universal best chunker

For markdown documentation: split at heading boundaries
For PDFs: convert to Markdown first, then split at paragraph boundaries
For code: use an AST parser

A generic 512-token splitter will technically work on all three. It will not be optimal on any of them. And on code specifically, the degradation is not marginal, it's a near-halving of retrieval precision.

Pick the chunker that matches the semantic structure of the data, not the one that's easiest to configure.

The harder version of this problem is mixed content, a PDF with embedded code blocks, a GitHub repo where half the files are Python and half are Markdown. Each file type still needs its own chunking strategy, which means the chunker has to detect content type at the file level and route accordingly. That's what the connector layer in this project handles, but it's a separate problem worth its own post.

I'm building a production RAG system that ingests multiple source types with per-source-type chunking strategies. Future posts cover the reranker experiments, eval methodology, and the CI pipeline I built around RAGAS scores.