- Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Picture a support engineer asking a doc bot why a webhook retry is failing. The retriever returns half of a markdown table: the header row, two body rows, then a hard cut at exactly 512 tokens. The remaining four rows, including the one that lists 429 as the retry trigger, go to the next chunk. The generator answers confidently and wrong. A better embedder will not save you. A fancier reranker will not save you. The fix is a splitter that knows what a table is.
Fixed-size token chunking is the default in most RAG tutorials because it is one line of code and the chunk count is predictable. It is also the reason your retrieval looks fine on prose and falls apart on anything with structure. Tables get bisected, code blocks lose their opening fence, and a bulleted list will happily keep its first three items while orphaning the last four under a heading the retriever never sees again.
What follows is a tokenizer-aware splitter for markdown that respects structure and stays under a soft token budget. Python, tiktoken, no external chunking library.
The failure mode in one snippet
Here is the canonical fixed-size splitter every tutorial ships:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
def fixed_chunks(text: str, size: int = 512) -> list[str]:
tokens = enc.encode(text)
return [
enc.decode(tokens[i : i + size])
for i in range(0, len(tokens), size)
]
It cuts at token 512 regardless of what is there. If token 512 lands inside a | col1 | col2 | row, half the table goes to chunk N and half to chunk N+1. The retriever embeds two halves of one idea. Neither half answers the question on its own.
A February 2026 benchmark over 50 academic papers put recursive 512-token splitting at 69% retrieval accuracy and pure semantic chunking at 54%. The recursive strategy wins because it cuts at boundaries the document already provides. The splitter below extends that idea to markdown specifically, and counts tokens with the model's real tokenizer instead of characters.
Detect the structure first
Before splitting, parse the markdown into a flat list of blocks: headings, paragraphs, fenced code, tables, lists. Each block carries its level (for headings) and its raw text.
import re
from dataclasses import dataclass
@dataclass
class Block:
kind: str # h1..h6, para, code, table, list
level: int # heading level, else 0
text: str
Then a parser that walks line by line. It is small on purpose; it covers the markdown features that actually break naive splitters.
HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
FENCE = re.compile(r"^`{3}")
TABLE_ROW = re.compile(r"^\s*\|.*\|\s*$")
LIST_ITEM = re.compile(r"^\s*([-*+]|\d+\.)\s+")
def parse(md: str) -> list[Block]:
lines = md.splitlines()
blocks: list[Block] = []
i = 0
while i < len(lines):
ln = lines[i]
if m := HEADING.match(ln):
lvl = len(m.group(1))
blocks.append(Block(f"h{lvl}", lvl, ln))
i += 1
elif FENCE.match(ln):
i = _consume_code(lines, i, blocks)
elif TABLE_ROW.match(ln):
i = _consume_table(lines, i, blocks)
elif LIST_ITEM.match(ln):
i = _consume_list(lines, i, blocks)
elif ln.strip() == "":
i += 1
else:
i = _consume_paragraph(lines, i, blocks)
return blocks
The _consume_* helpers each accumulate their kind of block until the run ends, then append a single Block and return the next index. The point: a table is one atomic block, a fenced code section is one atomic block, and a list is one atomic block. The splitter never gets the chance to cut inside them.
Now wrap tiktoken so you can ask any block for its token count once and cache it.
import tiktoken
ENC = tiktoken.encoding_for_model("gpt-4o")
def n_tokens(text: str) -> int:
return len(ENC.encode(text))
For a real corpus, cache n_tokens(block.text) on the block when you parse. Re-encoding the same string repeatedly is the slow part of every naive token splitter.
A soft budget means: aim for target tokens per chunk, allow up to hard_max, never go past hard_max. For embedding models with 512-token windows, target=480 and hard_max=512 works. For long-context embedders like text-embedding-3-large (8191 tokens), target=800 and hard_max=1024 matches what most production RAG systems ship.
Greedy pack with priority splits
The packing loop is the heart of it. Iterate blocks. Add to the current chunk if it fits the target. If a single block is larger than hard_max, recursively split that block by priority: paragraph, then sentence, then word. Headings always stick to the next block so a section title never strands itself.
def pack(
blocks: list[Block],
target: int = 480,
hard_max: int = 512,
) -> list[str]:
chunks: list[str] = []
buf: list[Block] = []
buf_tokens = 0
def flush():
nonlocal buf, buf_tokens
if buf:
chunks.append("\n\n".join(b.text for b in buf))
buf, buf_tokens = [], 0
for blk in blocks:
bt = n_tokens(blk.text)
if bt > hard_max:
flush()
for piece in split_oversize(blk, hard_max):
chunks.append(piece)
continue
if buf_tokens + bt > target and buf_tokens > 0:
flush()
buf.append(blk)
buf_tokens += bt
flush()
return chunks
Two things matter here. First, oversize blocks bypass the buffer entirely; they own their own chunks. Second, the target check fires only when the buffer is non-empty, so a block exactly at target still lands in a chunk on its own rather than triggering an empty flush.
A 700-token paragraph still needs to fit somewhere. Walk the priority ladder: split by H3 sub-headings if the block came from a long heading section, then by paragraph (\n\n), then by sentence (regex on [.!?]\s+), then by word as a last resort. Stop at the first ladder rung that produces pieces under hard_max.
SENTENCE = re.compile(r"(?<=[.!?])\s+")
def split_oversize(blk: Block, hard_max: int) -> list[str]:
if blk.kind == "code":
return _split_code(blk.text, hard_max)
parts = blk.text.split("\n\n")
if max(n_tokens(p) for p in parts) <= hard_max:
return _greedy_join(parts, hard_max)
parts = SENTENCE.split(blk.text)
if max(n_tokens(p) for p in parts) <= hard_max:
return _greedy_join(parts, hard_max)
return _split_by_tokens(blk.text, hard_max)
Code blocks deserve their own splitter because you want to keep the fence and language tag on each piece. _split_code re-emits the opening python fence at the top of every fragment so the chunk is still valid markdown when the retriever returns it. _split_by_tokens is the last-resort tiktoken-windowed fallback.
The _greedy_join helper packs sub-pieces back up to hard_max so you do not over-fragment. A paragraph that splits into eight sentences of 50 tokens each becomes one chunk of 400 tokens, not eight chunks of 50.
Boundary overlap, not token overlap
Most splitters overlap by N tokens. That copies a slice of arbitrary mid-sentence text into the next chunk. With structured blocks, overlap by boundary instead: the last heading and the last full sentence of chunk N become the lead of chunk N+1.
def with_overlap(chunks: list[str]) -> list[str]:
out = [chunks[0]]
for i in range(1, len(chunks)):
prev_tail = _last_heading_and_sentence(chunks[i - 1])
out.append(f"{prev_tail}\n\n{chunks[i]}")
return out
_last_heading_and_sentence scans the previous chunk for the most recent line starting with # and the last sentence in the trailing paragraph, joins them, caps the overlap at ~50 tokens, and returns the string. The retriever now gets a reminder of what section a chunk came from without copying noise.
This is the move that recovers the table-half failure from the opener. Even if a giant table forces a split, chunk N+1 starts with the H2 it sits under and the table's header row, so the embedding still says "this chunk is about webhook retries."
Putting it together
The full pipeline, end to end:
def chunk_markdown(
md: str,
target: int = 480,
hard_max: int = 512,
) -> list[str]:
blocks = parse(md)
raw = pack(blocks, target, hard_max)
return with_overlap(raw)
Three calls. Parse, pack, overlap. Run it on a doc set and inspect the boundaries: every chunk should start at a heading, a paragraph break, or an overlap line carrying the previous heading. None should start mid-sentence. None should contain half a table. None should leave a code fence unclosed.
A small sanity check pays off in production:
FENCE_MARK = "`" * 3
def audit(chunks: list[str]) -> None:
for i, c in enumerate(chunks):
if c.count(FENCE_MARK) % 2 != 0:
raise ValueError(f"chunk {i}: unbalanced fence")
toks = n_tokens(c)
if toks > 600:
raise ValueError(f"chunk {i}: budget {toks}")
Add this to your ingest pipeline. Catching the unbalanced-fence case once will save you a confused weekend later.
Fixed-size token splitting only looks good on benchmarks built from research-paper PDFs without tables, code, or nested lists, and real corpora rarely cooperate. Once the splitter respects H2/H3, paragraphs, sentences, and atomic blocks under a soft token budget with boundary overlap, the retrieval numbers move on their own and the per-query cost stays at zero. Drop it into your ingest pipeline next sprint, run your eval set against the new chunks, and spend the rest of the quarter on the parts of the stack that actually need an LLM.
If this was useful
The chunking chapter of the RAG Pocket Guide goes deeper on token-aware splitters, including the budget-vs-recall tradeoff per embedding model, how to handle PDFs with extracted layout, and the eval rig for catching chunking regressions before they hit prod. If you are tuning a real corpus, it is the chapter to start at.

Top comments (0)