DEV Community: kartikey rajvaidya

Free contextual chunk headers: heading-aware chunking for hybrid retrieval

kartikey rajvaidya — Mon, 01 Jun 2026 11:10:36 +0000

In September 2024, Anthropic published Contextual Retrieval. The trick: generate a one-sentence context per chunk with an LLM and prepend it to the chunk before embedding. On their hybrid vector + BM25 setup, the top-20 retrieval failure rate drops from 5.7% to 2.9% (a 49% reduction). Add a reranker and it falls to 1.9% (67% reduction). Their published cost is around $1.02 per million document tokens, with prompt caching applied.

If your source documents have a clean heading hierarchy, the document itself gives you a usable prefix for free. No LLM call per chunk. This post is how that path looks in production, in the itrstats tax assistant, where the knowledge base lives in markdown and the retriever is hybrid pgvector + Postgres tsvector.

.md file              ┌─────────┐    ┌──────────┐    ┌─────────┐    ┌────────────┐
or scraped URL ──────▶│ cleaner │───▶│ splitter │───▶│ chunker │───▶│  embedder  │
                      └─────────┘    └──────────┘    └────┬────┘    └─────┬──────┘
                                                          │               │
                                                          ▼               ▼
                                                    chunk text       1536-d vec
                                                          │               │
                                                          └───────┬───────┘
                                                                  ▼
                                                          ┌──────────────┐
                                                          │  Postgres    │
                                                          │  + pgvector  │
                                                          │  + tsvector  │
                                                          └──────────────┘

Prepend the breadcrumb into the chunk text

Most chunking tutorials use LangChain's RecursiveCharacterTextSplitter on plain text extracted from the source. For unstructured prose, that's fine. For markdown documents with heading hierarchy, it throws away the orientation cue the author put there on purpose. Strip the headings, chunk the body, embed: the chunk says "the additional tax cannot exceed the additional income above the threshold" without saying which threshold, in which regime, for which year.

The fix is one line of Python. After splitting on headings, prepend the heading path back into each chunk's text content before embedding:

for section in sections:
    prefix = format_breadcrumb(doc_title, section.breadcrumb)
    prefix_tokens = count_tokens(prefix)
    body_budget = CHUNK_SIZE_TOKENS - prefix_tokens - BREADCRUMB_SAFETY_TOKENS

    body_chunks = chunk_text(
        section.body,
        chunk_size=body_budget,
        overlap=OVERLAP_TOKENS,
        min_chunk_size=1,
    )
    for body in body_chunks:
        result.append(f"{prefix}\n\n{body}")

For a paragraph from our knowledge base, the output looks like this (with CHUNK_SIZE_TOKENS = 512, BREADCRUMB_SAFETY_TOKENS = 8):

Income Tax Slabs FY 2025-26 > Marginal Relief

Marginal relief applies at the 87A rebate boundary and at each surcharge
threshold. When income slightly exceeds a threshold, the additional tax
cannot exceed the additional income above the threshold.

The breadcrumb steals tokens from the body budget; the body splitter gets the remainder. That's the only arithmetic that matters. Skip this subtraction and deeply-nested sections silently overshoot the embedding model's context window.

Headings are free supervision. The author put them in to tell future readers what a paragraph is about. Throwing them away before chunking is throwing away that signal.

Why it works for hybrid retrieval

Our retriever runs vector ANN against text-embedding-3-small (1536 dims) and BM25-style scoring over a Postgres tsvector, merged via Reciprocal Rank Fusion (k=60). Because the breadcrumb lives in the chunk's text, both retrievers see it:

chunk text: "Income Tax Slabs FY 2025-26 > Marginal Relief\n\nMarginal..."
                                    │
                ┌───────────────────┴───────────────────┐
                ▼                                       ▼
       text-embedding-3-small                 tsvector + ts_rank
       (1536-d vector)                        (BM25 over tokens)
                │                                       │
                └────────────────┬──────────────────────┘
                                 ▼
                       Reciprocal Rank Fusion
                       Σ 1 / (60 + rank)

For a query like "marginal relief new regime 87A rebate threshold", the vector picks up the topic the breadcrumb names, and BM25 matches the exact tokens "Marginal Relief" even though the body prose doesn't repeat that phrase. Both retrievers rank the breadcrumb-prepended chunk higher than the naive version.

Now consider the alternative. If you store the breadcrumb in a separate metadata column instead of inside the chunk text, you can still attach it as a citation at answer time. But the retriever scoring sees nothing of it. You paid the storage cost without earning the retrieval cost back.

The LangChain default goes the other way

LangChain's MarkdownHeaderTextSplitter defaults to strip_headers=True. It removes the heading line from chunk content and puts the heading path into a metadata field. That's a reasonable default for pipelines without hybrid retrieval, where the chunk text is purely for embedding and headers are referenced at answer time only. For a hybrid setup, it's suboptimal: the BM25 side never sees the section name.

The fix is one keyword argument (strip_headers=False) and a post-step to format the metadata path as a breadcrumb prefix. Or roll your own splitter that does both in one pass, which is what we did. Either way, get the breadcrumb back into the chunk text before embedding.

Three things to steal

1. Contextual chunk headers are a known win; the deterministic flavour costs nothing. Anthropic's Contextual Retrieval is the LLM-generated cousin: same idea (prefix the chunk with context before embedding), different source of the prefix (a generated sentence vs. the document's own heading hierarchy). If your docs have headings, you don't need the LLM call.

2. Put the prefix in the chunk text, not in a metadata column. Storing the breadcrumb as text rides it into both the embedding and the BM25 tsvector. Storing it as metadata gets you only the citation pass-through. That's half the benefit.

3. Subtract the prefix from your chunk-size budget. Treat the breadcrumb as part of the chunk's content for token counting. Skip this and deeply-nested sections produce chunks that overshoot the embedding model's context window.

The user asking our tax bot "is marginal relief relevant at ₹15 lakh under the new regime?" gets the right chunk surfaced not because the embedding model is smart, but because the chunk carries its own topic anchor. The author wrote the heading. The pipeline kept it. The retriever read it. Nothing about that sequence required a model call we didn't already need.

How the itrstats tax assistant works: one query, every layer

kartikey rajvaidya — Mon, 18 May 2026 21:47:48 +0000

This post walks through how the itrstats (https://itrstats.in) tax assistant handles a single compound user question, end to end through every layer of the backend.

A user types this in:

What's tax on ₹15 lakh in new regime, what percentile am I in, and is the marginal relief rule relevant here?

The response that comes back:

In the new regime, tax on ₹15 lakh is ₹97,500 for FY 2025-26.

Under the old regime, the same income would be ₹2,57,400, so the new regime is cheaper by ₹1,59,900.

You are in roughly the top 17.42% of Indian taxpayers, with about 82.58% earning less.

Marginal relief is not relevant here because it applies around the ₹12 lakh 87A rebate threshold and surcharge thresholds, not at ₹15 lakh.

Behind that response, three tools fired, the model made two passes, and a composer did a final validation strip before anything left the server. The whole thing finishes in about four seconds. The model did not compute a single number in the entire trace. It picked tools, narrated the result, and was kept on the rails by a Pydantic-enforced output schema. This post follows that one query through every layer of the system.

┌──────┐   POST /v1/assistant/query   ┌──────────┐   ┌─────────────┐
│ user │ ───────────────────────────▶ │  route   │──▶│   action    │
└──────┘                              └──────────┘   └──────┬──────┘
   ▲                                                        │
   │                                                        ▼
   │                                              ┌─────────────────┐
   │                                              │   agent loop    │
   │                                              │  (MAX_HOPS=3)   │
   │                                              └────────┬────────┘
   │                                                       │
   │                              ┌────────────────────────┼────────────────────────┐
   │                              ▼                        ▼                        ▼
   │                  ┌────────────────────┐   ┌─────────────────────┐   ┌──────────────────┐
   │                  │ compute_income_tax │   │ compute_income_     │   │ retrieve_        │
   │                  │ (pure Python)      │   │ percentile (Python) │   │ knowledge (RAG)  │
   │                  └─────────┬──────────┘   └──────────┬──────────┘   └────────┬─────────┘
   │                            │                         │                       │
   │                            └─────────────────────────┴───────────────────────┘
   │                                                      │
   │                                                      ▼
   │                                            ┌──────────────────┐
   │                                            │ agent (hop 2):   │
   │                                            │ narrate answer   │
   │                                            └────────┬─────────┘
   │                                                     ▼
   │                                            ┌──────────────────┐
   └──── public response  ◀──────────────────── │    composer      │
                                                └──────────────────┘

Hop zero: the request arrives

The POST lands at /v1/assistant/query. The route is intentionally thin: validate, rate-limit, call one action, return. Orchestration lives in the action layer, which does the boring-but-necessary plumbing before the agent ever runs:

Resolves a conversation ID, generating a new one if needed.
Persists the request row to Postgres before the model is called, so a downstream crash still leaves a record of what was asked.
Loads recent conversation turns and replays them into the agent's input as alternating user/assistant messages. There is no server-side thread state in the model.
Hands off to run_agent, then persists the final response on the way back out.

That's the route → action layer. The interesting work starts inside run_agent.

Hop one: the agent decides what to call

The agent seeds its input list with system prompt, replayed turns, and the current query:

def _build_input_messages(ctx: AgentContext) -> list[dict[str, Any]]:
    messages = [{"role": "system", "content": build_knowledge_system_prompt()}]
    for turn in ctx.conversation_turns:
        messages.append({"role": "user", "content": turn.question})
        messages.append({"role": "assistant", "content": turn.answer})
    messages.append({"role": "user", "content": ctx.assistant_query_data.question.strip()})
    return messages

The model call uses OpenAI's Responses API with three knobs that matter:

response = await client.responses.parse(
    model=AssistantConfig.OPENAI_ANSWER_MODEL,
    input=input_list,
    tools=tools,                         # JSON Schema for each registered tool
    tool_choice="auto",
    parallel_tool_calls=True,
    text_format=AssistantAnswerSchema,   # Pydantic-enforced output
)

tools= is the JSON-Schema list of registered tools the model can call. parallel_tool_calls=True lets the model request several tool calls in one response instead of one at a time. text_format=AssistantAnswerSchema constrains the final answer to a Pydantic schema, so once the model stops calling tools it cannot return free-form text.

For our ₹15 lakh query, hop 1's output is three function calls in one response:

hop=1 model output:
  function_call: compute_income_tax
    arguments: { "gross_income": 1500000, "fy": "2025-26",
                 "regime": "both", "age_category": "general",
                 "deductions": { all six keys, all zero } }
  function_call: compute_income_percentile_forward
    arguments: { "income": 1500000 }
  function_call: retrieve_knowledge
    arguments: { "query": "marginal relief new regime 87A rebate threshold",
                 "decompose": false }

No final answer this hop. Just three calls. The loop body executes them and re-runs the model:

for hop in range(MAX_HOPS):
    response = await client.responses.parse(...)
    function_calls = _detect_function_calls(response)
    if not function_calls:
        return response.output_parsed   # final answer

    input_list.extend(_serialize_response_item(item) for item in response.output)
    for call in function_calls:
        output_str = await _execute_tool_call(call, ctx)
        input_list.append({
            "type": "function_call_output",
            "call_id": call.call_id,
            "output": output_str,
        })

MAX_HOPS = 3. Past the third hop, a forced final-pass message strips the tool registry and demands an answer. The loop cannot run away.

   seed messages: [system_prompt, ...history, current_query]
                                │
                                ▼
   ┌──────────[Responses API call with tools]────────────┐
   │                            │                         │
   ▼                            ▼                         │
function_calls?            final answer?                  │
   │                            │                         │
   │ yes:                       │ yes: return             │
   │  execute tools,            │                         │
   │  append outputs                                      │
   │                                                      │
   └──────── hop++ < MAX_HOPS=3 ──────────────────────────┘

The LLM is the dispatcher, not the calculator. A model that hallucinates tax math is a liability. A model that picks the right tool and forwards the user's numbers is a feature.

The three tools fire

Each registered tool is a ToolSpec with four fields: a name, a description, a JSON Schema for its parameters, and an async handler that takes an AgentContext (a per-turn struct carrying request metadata, prior turns, and a collected_chunks field that retrieval tools deposit into) plus the model's arguments and returns a dict. The model sees the first three; the handler runs server-side. For the query we're tracing, the model exercises three tools: compute_income_tax, compute_income_percentile, and retrieve_knowledge. The first two are deterministic Python; the third is stochastic retrieval. That split, between tools whose output you can be confident in and tools whose output you cannot, is the architecturally interesting part of the registry, and we follow each in turn.

compute_income_tax

The tool wrapper at core/orchestrator/tools/compute_income_tax.py exists for one reason, and its description states it plainly:

ALWAYS call this tool for any tax-computation request. Do NOT compute slab tax yourself. … Copy the returned formatted_summary VERBATIM into your answer. Do not paraphrase numbers.

The calculator computes; the model transcribes.

Two layers sit behind that contract. The wrapper validates inputs and forwards to a pure-Python calculator at core/tools/calculators/income_tax.py: slab tables for FY 2025-26, 87A rebate, surcharge bands, 4% cess, two flavours of marginal relief (rebate-boundary and surcharge-threshold).

For our ₹15 lakh query, the model emitted:

INPUT  → compute_income_tax({
  "gross_income": 1500000,
  "fy": "2025-26",
  "age_category": "general",
  "regime": "both",
  "deductions": { "sec_80c": 0, "sec_80d": 0, "hra_exempt": 0,
                  "home_loan_interest": 0, "nps_80ccd_1b": 0,
                  "nps_80ccd_2_employer": 0 }
})

OUTPUT → {
  "new": { "taxable_income": 1425000, "base_tax": 93750,
           "cess": 3750, "total_tax": 97500,
           "effective_rate_pct": 6.5 },
  "old": { "total_tax": 257400 },
  "cheaper_regime": "new",
  "savings_under_cheaper": 159900,
  "formatted_summary": "Under new regime, tax on ₹15,00,000 is..."
}

Two things to notice. The model passed gross_income: 1500000 (a literal integer in rupees, not the string "15 LPA") because the tool description spells out the conversion. The tool returned 97500 after applying the default ₹75,000 standard deduction (1500000 - 75000 = 1425000 taxable). The model did neither computation. Both happened inside the calculator.

The calculator returns a typed IncomeTaxRegimeBreakdown:

class IncomeTaxRegimeBreakdown(BaseModel):
    regime: Literal["old", "new"]
    standard_deduction: int
    taxable_income: int
    base_tax: int
    rebate_87a: int
    surcharge: int
    cess: int
    marginal_relief_rebate: int
    marginal_relief_surcharge: int
    total_tax: int
    effective_rate_pct: float

The canonical case is pinned in a golden test:

def test_15L_new_regime_no_deductions():
    r = compute_income_tax(gross_income=1500000, regime="new")
    assert r.new.total_tax == 97500

Moving arithmetic out of the model is only useful if the calculator stays correct over time. These tests run in under a second and each pins a canonical case to a number.

compute_income_percentile

The percentile tool follows the same shape (a thin wrapper around a pure-Python function) and reads from a real ITR filing-statistics dataset. For our query, the model called it with the user's income:

INPUT  → compute_income_percentile_forward({ "income": 1500000 })

OUTPUT → {
  "top_percent": 17.42,
  "percentile_from_bottom": 82.58,
  "financial_year": "2024-25",
  "formatted_summary": "₹15 lakh puts you in the top 17.42%..."
}

The 17.42% and 82.58% in the user-visible answer are these two fields, copied through.

retrieve_knowledge

The third tool that fired is the only one that isn't a pure calculator. retrieve_knowledge wraps retrieve_relevant_chunks() in core/services/retrieval.py, which runs two retrievals in parallel and merges them:

query ──┬──▶ embed (text-embedding-3-small, 1536d) ──▶ pgvector ANN ─┐
        │                                                             ├─▶ RRF ─▶ top-K
        └──▶ plainto_tsquery('english', ...)  ─▶ tsvector / ts_rank ──┘
                                                          1 / (60 + rank)

Vector search nails semantic paraphrase ("set off" vs. "offset"). BM25 over the Postgres tsvector nails rare proper nouns the embedding flattens ("Section 115BBH", "194A"). Reciprocal Rank Fusion lets both win without either dominating: each chunk scores Σ 1 / (60 + rank) across the two lists, top-K wins.

Chunking is heading-aware: every chunk has its breadcrumb prepended before embedding, so the heading context rides into both the embedding and the tsvector. For our marginal-relief leg, the chunk that ranked first came from the slabs document, retrieved with its breadcrumb intact:

Income Tax Slabs FY 2025-26 > Marginal Relief

Marginal relief applies at the 87A rebate boundary and at each surcharge
threshold. When income slightly exceeds a threshold, the additional tax
cannot exceed the additional income above the threshold.

The breadcrumb (Income Tax Slabs FY 2025-26 > Marginal Relief) is the difference between a paragraph that ranks first and one that sits unranked in embedding space. It is also what gave the model the "₹12 lakh 87A rebate threshold and surcharge thresholds" framing it used in its final answer.

Hop two: the agent narrates the answer

Tool outputs are JSON-serialized and appended to the message list as function_call_output items, each tagged with the call_id of the function call it answers. The loop re-runs client.responses.parse(...) against the enriched message list. This time, no tool calls. The model emits a final answer constrained to the AssistantAnswerSchema: an answer string, a reasoning_summary, a confidence value, a next_actions list, an optional disclaimer, and an internal topic field that the composer uses to dispatch.

For our query, hop 2's parsed output:

hop=2 final output (parsed against AssistantAnswerSchema):
  confidence: "high"
  answer: "- In the new regime, tax on **₹15 lakh** is **₹97,500** for
           FY 2025-26.\n- Under the old regime..."
  reasoning_summary: "- Used the income-tax calculator for FY 2025-26
           and the percentile tool for ₹15 lakh income..."
  next_actions: ["Compare old vs new for my deductions",
                 "What income is top 10%?",
                 "How is 87A rebate applied?"]
  disclaimer: null
  topic: <internal value that maps to the answered branch>

Every number in answer traces back to a tool output: 97500, 257400, 159900 from compute_income_tax; 17.42 and 82.58 from compute_income_percentile_forward; the "₹12 lakh 87A rebate threshold and surcharge thresholds" framing from a retrieved chunk. The model wrote the prose; it did not invent the figures. disclaimer: null is fine. The composer injects it based on which tools fired, not on what the model thinks. And if Pydantic parsing fails (e.g., next_actions has four items), the API errors and the loop catches it. The model never delivers an unstructured response downstream.

The composer ships it

The model's AssistantAnswerSchema is internal. The public response (what the user's browser ultimately receives) is AssistantResponseSchema: stricter rules, a coarser four-value status enum, and no topic field at all.

The translation happens in _compose_response:

def _compose_response(
    ctx: AgentContext,
    answer: AssistantAnswerSchema,
) -> AssistantResponseSchema:
    if answer.topic == "refused":
        return build_orchestrator_refused_response(...)
    if answer.topic == "out_of_scope":
        return build_orchestrator_out_of_scope_response(...)
    if answer.topic == "needs_clarification":
        return build_orchestrator_needs_clarification_response(...)
    return build_orchestrator_answered_response(
        request_id=ctx.request_id,
        conversation_id=ctx.conversation_id,
        answer=answer,
        retrieved_chunks=ctx.collected_chunks,
        tools_called=ctx.tools_called,
    )

Four branches, dispatched on topic. For refused and out_of_scope, the composer discards the model's answer text and substitutes hardcoded copy. The model is welcome to refuse, but it is not welcome to write the refusal. Hardcoded text means legal review happens once, not every release. For needs_clarification, the model's text passes through (the clarifying question must be context-specific) but citations and disclaimer are forced empty.

For our query, the topic falls through to the answered branch. The composer passes the model's prose through, strips the internal topic field, looks at tools_called to inject the right citation (calculator answers get the income-tax-calculator CTA), and injects the FY-scoped CA disclaimer. The public schema then enforces hard constraints at the boundary: next_actions non-blank, max three. The model already agreed to this shape in the structured output; the composer re-checks at the seam anyway.

        topic enum (model output)
                  │
   ┌──────────────┼───────────────┬───────────────┐
   ▼              ▼               ▼               ▼
answered     needs_clarif.    out_of_scope     refused
   │              │               │               │
   │              │               ▼               ▼
   │              │       (discard model      (discard model
   │              │        text, hardcode      text, hardcode
   │              │        answer)             answer)
   │              │               │               │
   │              ▼               │               │
   │     (force citations=[],     │               │
   │      disclaimer=null)        │               │
   │              │               │               │
   ▼              ▼               ▼               ▼
        strip internal field (topic)
                       │
                       ▼
              AssistantResponseSchema  ──▶  public JSON

Structured output is a guardrail, not a convenience. If any model output can leak straight to users, you've outsourced your domain correctness to a stochastic process.

How we know it's not lying

The whole post argues that the bot doesn't compute its own numbers. That claim needs to be verified, not just asserted.

Two layers do the work.

Calculators are pinned with golden tests:

def test_5L_new_regime_full_rebate():
    r = compute_income_tax(gross_income=500000, regime="new")
    assert r.new.total_tax == 0
    assert r.new.taxable_income == 425000

def test_12L_new_regime_at_rebate_threshold():
    r = compute_income_tax(gross_income=1200000, regime="new")
    assert r.new.total_tax == 0

def test_marginal_relief_above_12L_new_regime():
    """Taxable just over 12L → total tax capped at the excess income."""
    ...

These run in under a second. Each names a canonical scenario and pins it to a number. The calculator is the source of truth for the assistant; the tests are the guard rail that keep the calculator honest as slabs and thresholds change year to year.

Retrieval is checked by a small golden-query eval. tests/retrieval/golden_queries.jsonl carries cases like:

{
  "id": "vda_loss_setoff_salary",
  "query": "I lost 50000 on Bitcoin, can I set off against my salary income?",
  "expected_text_substr": [
    "loss from transfer of virtual digital asset",
    "vda losses cannot be set off",
    "115bbh"
  ],
  "failure_mode": "TRUE_FAIL"
}

scripts/retrieval_eval.py runs each case through retrieve_relevant_chunks() and reports two numbers:

Hit@K: did any chunk in the top K contain at least one of the expected substrings (case-insensitive).
MRR: mean reciprocal rank of the first hit, averaged across cases.

Each case carries multiple substring variants; a chunk counts as a hit if any one matches. That trades precision for resilience: a single brittle needle would mean a rephrasing of the source paragraph fails the test for no real reason.

This set is intentionally small. It's a smoke detector, not a quality measurement. But cheap evals catch regressions unit tests cannot. The unit tests verify code; the eval verifies behaviour. They run in CI and they have caught real shifts in chunk ranking that nothing else would have.

What this shape buys you

Three transferable ideas, stated plainly.

1. The model is a dispatcher, not the source of truth. Anything that has to be deterministic (numbers, business rules, contracts) belongs in a tool. The model picks tools and narrates results. Drawing the line between "this can be wrong" and "this must be right" is the boundary between the model and the calculators. A lot of decisions follow from that line.

2. Structured output is a guardrail. Pydantic-enforced output schemas turn the model's free-text into a typed value the rest of your application can rely on. The composer is then where domain rules live, not the prompt. Tell the model what to do is brittle; make it impossible for the model to return the wrong shape is not.

3. A small golden eval is the cheapest production-confidence tool you have. A small set of retrieval goldens catches regressions a unit-test suite cannot see. A small set of calculator goldens catches what a retrieval eval cannot. Neither is a substitute for the other, and both are cheap enough to never not have.

The user who asked our anchor query got a correct answer not because the model is smart, but because the architecture refuses to let the model be wrong about anything that has a right answer. The calculator wrote the numbers. The retrieved chunk wrote the framing. The composer wrote the contract. The model wrote the prose. Each layer did one job, and only the prose layer was allowed to be creative.