Free contextual chunk headers: heading-aware chunking for hybrid retrieval

#python #llm #rag #postgres

In September 2024, Anthropic published Contextual Retrieval. The trick: generate a one-sentence context per chunk with an LLM and prepend it to the chunk before embedding. On their hybrid vector + BM25 setup, the top-20 retrieval failure rate drops from 5.7% to 2.9% (a 49% reduction). Add a reranker and it falls to 1.9% (67% reduction). Their published cost is around $1.02 per million document tokens, with prompt caching applied.

If your source documents have a clean heading hierarchy, the document itself gives you a usable prefix for free. No LLM call per chunk. This post is how that path looks in production, in the itrstats tax assistant, where the knowledge base lives in markdown and the retriever is hybrid pgvector + Postgres tsvector.

.md file              ┌─────────┐    ┌──────────┐    ┌─────────┐    ┌────────────┐
or scraped URL ──────▶│ cleaner │───▶│ splitter │───▶│ chunker │───▶│  embedder  │
                      └─────────┘    └──────────┘    └────┬────┘    └─────┬──────┘
                                                          │               │
                                                          ▼               ▼
                                                    chunk text       1536-d vec
                                                          │               │
                                                          └───────┬───────┘
                                                                  ▼
                                                          ┌──────────────┐
                                                          │  Postgres    │
                                                          │  + pgvector  │
                                                          │  + tsvector  │
                                                          └──────────────┘

Prepend the breadcrumb into the chunk text

Most chunking tutorials use LangChain's RecursiveCharacterTextSplitter on plain text extracted from the source. For unstructured prose, that's fine. For markdown documents with heading hierarchy, it throws away the orientation cue the author put there on purpose. Strip the headings, chunk the body, embed: the chunk says "the additional tax cannot exceed the additional income above the threshold" without saying which threshold, in which regime, for which year.

The fix is one line of Python. After splitting on headings, prepend the heading path back into each chunk's text content before embedding:

for section in sections:
    prefix = format_breadcrumb(doc_title, section.breadcrumb)
    prefix_tokens = count_tokens(prefix)
    body_budget = CHUNK_SIZE_TOKENS - prefix_tokens - BREADCRUMB_SAFETY_TOKENS

    body_chunks = chunk_text(
        section.body,
        chunk_size=body_budget,
        overlap=OVERLAP_TOKENS,
        min_chunk_size=1,
    )
    for body in body_chunks:
        result.append(f"{prefix}\n\n{body}")

For a paragraph from our knowledge base, the output looks like this (with CHUNK_SIZE_TOKENS = 512, BREADCRUMB_SAFETY_TOKENS = 8):

Income Tax Slabs FY 2025-26 > Marginal Relief

Marginal relief applies at the 87A rebate boundary and at each surcharge
threshold. When income slightly exceeds a threshold, the additional tax
cannot exceed the additional income above the threshold.

The breadcrumb steals tokens from the body budget; the body splitter gets the remainder. That's the only arithmetic that matters. Skip this subtraction and deeply-nested sections silently overshoot the embedding model's context window.

Headings are free supervision. The author put them in to tell future readers what a paragraph is about. Throwing them away before chunking is throwing away that signal.

Why it works for hybrid retrieval

Our retriever runs vector ANN against text-embedding-3-small (1536 dims) and BM25-style scoring over a Postgres tsvector, merged via Reciprocal Rank Fusion (k=60). Because the breadcrumb lives in the chunk's text, both retrievers see it:

chunk text: "Income Tax Slabs FY 2025-26 > Marginal Relief\n\nMarginal..."
                                    │
                ┌───────────────────┴───────────────────┐
                ▼                                       ▼
       text-embedding-3-small                 tsvector + ts_rank
       (1536-d vector)                        (BM25 over tokens)
                │                                       │
                └────────────────┬──────────────────────┘
                                 ▼
                       Reciprocal Rank Fusion
                       Σ 1 / (60 + rank)

For a query like "marginal relief new regime 87A rebate threshold", the vector picks up the topic the breadcrumb names, and BM25 matches the exact tokens "Marginal Relief" even though the body prose doesn't repeat that phrase. Both retrievers rank the breadcrumb-prepended chunk higher than the naive version.

Now consider the alternative. If you store the breadcrumb in a separate metadata column instead of inside the chunk text, you can still attach it as a citation at answer time. But the retriever scoring sees nothing of it. You paid the storage cost without earning the retrieval cost back.

The LangChain default goes the other way

LangChain's MarkdownHeaderTextSplitter defaults to strip_headers=True. It removes the heading line from chunk content and puts the heading path into a metadata field. That's a reasonable default for pipelines without hybrid retrieval, where the chunk text is purely for embedding and headers are referenced at answer time only. For a hybrid setup, it's suboptimal: the BM25 side never sees the section name.

The fix is one keyword argument (strip_headers=False) and a post-step to format the metadata path as a breadcrumb prefix. Or roll your own splitter that does both in one pass, which is what we did. Either way, get the breadcrumb back into the chunk text before embedding.

Three things to steal

1. Contextual chunk headers are a known win; the deterministic flavour costs nothing. Anthropic's Contextual Retrieval is the LLM-generated cousin: same idea (prefix the chunk with context before embedding), different source of the prefix (a generated sentence vs. the document's own heading hierarchy). If your docs have headings, you don't need the LLM call.

2. Put the prefix in the chunk text, not in a metadata column. Storing the breadcrumb as text rides it into both the embedding and the BM25 tsvector. Storing it as metadata gets you only the citation pass-through. That's half the benefit.

3. Subtract the prefix from your chunk-size budget. Treat the breadcrumb as part of the chunk's content for token counting. Skip this and deeply-nested sections produce chunks that overshoot the embedding model's context window.

The user asking our tax bot "is marginal relief relevant at ₹15 lakh under the new regime?" gets the right chunk surfaced not because the embedding model is smart, but because the chunk carries its own topic anchor. The author wrote the heading. The pipeline kept it. The retriever read it. Nothing about that sequence required a model call we didn't already need.