This post walks through how the itrstats (https://itrstats.in) tax assistant handles a single compound user question, end to end through every layer of the backend.
A user types this in:
What's tax on ₹15 lakh in new regime, what percentile am I in, and is the marginal relief rule relevant here?
The response that comes back:
- In the new regime, tax on ₹15 lakh is ₹97,500 for FY 2025-26.
- Under the old regime, the same income would be ₹2,57,400, so the new regime is cheaper by ₹1,59,900.
- You are in roughly the top 17.42% of Indian taxpayers, with about 82.58% earning less.
- Marginal relief is not relevant here because it applies around the ₹12 lakh 87A rebate threshold and surcharge thresholds, not at ₹15 lakh.
Behind that response, three tools fired, the model made two passes, and a composer did a final validation strip before anything left the server. The whole thing finishes in about four seconds. The model did not compute a single number in the entire trace. It picked tools, narrated the result, and was kept on the rails by a Pydantic-enforced output schema. This post follows that one query through every layer of the system.
┌──────┐ POST /v1/assistant/query ┌──────────┐ ┌─────────────┐
│ user │ ───────────────────────────▶ │ route │──▶│ action │
└──────┘ └──────────┘ └──────┬──────┘
▲ │
│ ▼
│ ┌─────────────────┐
│ │ agent loop │
│ │ (MAX_HOPS=3) │
│ └────────┬────────┘
│ │
│ ┌────────────────────────┼────────────────────────┐
│ ▼ ▼ ▼
│ ┌────────────────────┐ ┌─────────────────────┐ ┌──────────────────┐
│ │ compute_income_tax │ │ compute_income_ │ │ retrieve_ │
│ │ (pure Python) │ │ percentile (Python) │ │ knowledge (RAG) │
│ └─────────┬──────────┘ └──────────┬──────────┘ └────────┬─────────┘
│ │ │ │
│ └─────────────────────────┴───────────────────────┘
│ │
│ ▼
│ ┌──────────────────┐
│ │ agent (hop 2): │
│ │ narrate answer │
│ └────────┬─────────┘
│ ▼
│ ┌──────────────────┐
└──── public response ◀──────────────────── │ composer │
└──────────────────┘
Hop zero: the request arrives
The POST lands at /v1/assistant/query. The route is intentionally thin: validate, rate-limit, call one action, return. Orchestration lives in the action layer, which does the boring-but-necessary plumbing before the agent ever runs:
- Resolves a conversation ID, generating a new one if needed.
- Persists the request row to Postgres before the model is called, so a downstream crash still leaves a record of what was asked.
- Loads recent conversation turns and replays them into the agent's input as alternating user/assistant messages. There is no server-side thread state in the model.
- Hands off to
run_agent, then persists the final response on the way back out.
That's the route → action layer. The interesting work starts inside run_agent.
Hop one: the agent decides what to call
The agent seeds its input list with system prompt, replayed turns, and the current query:
def _build_input_messages(ctx: AgentContext) -> list[dict[str, Any]]:
messages = [{"role": "system", "content": build_knowledge_system_prompt()}]
for turn in ctx.conversation_turns:
messages.append({"role": "user", "content": turn.question})
messages.append({"role": "assistant", "content": turn.answer})
messages.append({"role": "user", "content": ctx.assistant_query_data.question.strip()})
return messages
The model call uses OpenAI's Responses API with three knobs that matter:
response = await client.responses.parse(
model=AssistantConfig.OPENAI_ANSWER_MODEL,
input=input_list,
tools=tools, # JSON Schema for each registered tool
tool_choice="auto",
parallel_tool_calls=True,
text_format=AssistantAnswerSchema, # Pydantic-enforced output
)
tools= is the JSON-Schema list of registered tools the model can call. parallel_tool_calls=True lets the model request several tool calls in one response instead of one at a time. text_format=AssistantAnswerSchema constrains the final answer to a Pydantic schema, so once the model stops calling tools it cannot return free-form text.
For our ₹15 lakh query, hop 1's output is three function calls in one response:
hop=1 model output:
function_call: compute_income_tax
arguments: { "gross_income": 1500000, "fy": "2025-26",
"regime": "both", "age_category": "general",
"deductions": { all six keys, all zero } }
function_call: compute_income_percentile_forward
arguments: { "income": 1500000 }
function_call: retrieve_knowledge
arguments: { "query": "marginal relief new regime 87A rebate threshold",
"decompose": false }
No final answer this hop. Just three calls. The loop body executes them and re-runs the model:
for hop in range(MAX_HOPS):
response = await client.responses.parse(...)
function_calls = _detect_function_calls(response)
if not function_calls:
return response.output_parsed # final answer
input_list.extend(_serialize_response_item(item) for item in response.output)
for call in function_calls:
output_str = await _execute_tool_call(call, ctx)
input_list.append({
"type": "function_call_output",
"call_id": call.call_id,
"output": output_str,
})
MAX_HOPS = 3. Past the third hop, a forced final-pass message strips the tool registry and demands an answer. The loop cannot run away.
seed messages: [system_prompt, ...history, current_query]
│
▼
┌──────────[Responses API call with tools]────────────┐
│ │ │
▼ ▼ │
function_calls? final answer? │
│ │ │
│ yes: │ yes: return │
│ execute tools, │ │
│ append outputs │
│ │
└──────── hop++ < MAX_HOPS=3 ──────────────────────────┘
The LLM is the dispatcher, not the calculator. A model that hallucinates tax math is a liability. A model that picks the right tool and forwards the user's numbers is a feature.
The three tools fire
Each registered tool is a ToolSpec with four fields: a name, a description, a JSON Schema for its parameters, and an async handler that takes an AgentContext (a per-turn struct carrying request metadata, prior turns, and a collected_chunks field that retrieval tools deposit into) plus the model's arguments and returns a dict. The model sees the first three; the handler runs server-side. For the query we're tracing, the model exercises three tools: compute_income_tax, compute_income_percentile, and retrieve_knowledge. The first two are deterministic Python; the third is stochastic retrieval. That split, between tools whose output you can be confident in and tools whose output you cannot, is the architecturally interesting part of the registry, and we follow each in turn.
compute_income_tax
The tool wrapper at core/orchestrator/tools/compute_income_tax.py exists for one reason, and its description states it plainly:
ALWAYS call this tool for any tax-computation request. Do NOT compute slab tax yourself. … Copy the returned
formatted_summaryVERBATIM into your answer. Do not paraphrase numbers.
The calculator computes; the model transcribes.
Two layers sit behind that contract. The wrapper validates inputs and forwards to a pure-Python calculator at core/tools/calculators/income_tax.py: slab tables for FY 2025-26, 87A rebate, surcharge bands, 4% cess, two flavours of marginal relief (rebate-boundary and surcharge-threshold).
For our ₹15 lakh query, the model emitted:
INPUT → compute_income_tax({
"gross_income": 1500000,
"fy": "2025-26",
"age_category": "general",
"regime": "both",
"deductions": { "sec_80c": 0, "sec_80d": 0, "hra_exempt": 0,
"home_loan_interest": 0, "nps_80ccd_1b": 0,
"nps_80ccd_2_employer": 0 }
})
OUTPUT → {
"new": { "taxable_income": 1425000, "base_tax": 93750,
"cess": 3750, "total_tax": 97500,
"effective_rate_pct": 6.5 },
"old": { "total_tax": 257400 },
"cheaper_regime": "new",
"savings_under_cheaper": 159900,
"formatted_summary": "Under new regime, tax on ₹15,00,000 is..."
}
Two things to notice. The model passed gross_income: 1500000 (a literal integer in rupees, not the string "15 LPA") because the tool description spells out the conversion. The tool returned 97500 after applying the default ₹75,000 standard deduction (1500000 - 75000 = 1425000 taxable). The model did neither computation. Both happened inside the calculator.
The calculator returns a typed IncomeTaxRegimeBreakdown:
class IncomeTaxRegimeBreakdown(BaseModel):
regime: Literal["old", "new"]
standard_deduction: int
taxable_income: int
base_tax: int
rebate_87a: int
surcharge: int
cess: int
marginal_relief_rebate: int
marginal_relief_surcharge: int
total_tax: int
effective_rate_pct: float
The canonical case is pinned in a golden test:
def test_15L_new_regime_no_deductions():
r = compute_income_tax(gross_income=1500000, regime="new")
assert r.new.total_tax == 97500
Moving arithmetic out of the model is only useful if the calculator stays correct over time. These tests run in under a second and each pins a canonical case to a number.
compute_income_percentile
The percentile tool follows the same shape (a thin wrapper around a pure-Python function) and reads from a real ITR filing-statistics dataset. For our query, the model called it with the user's income:
INPUT → compute_income_percentile_forward({ "income": 1500000 })
OUTPUT → {
"top_percent": 17.42,
"percentile_from_bottom": 82.58,
"financial_year": "2024-25",
"formatted_summary": "₹15 lakh puts you in the top 17.42%..."
}
The 17.42% and 82.58% in the user-visible answer are these two fields, copied through.
retrieve_knowledge
The third tool that fired is the only one that isn't a pure calculator. retrieve_knowledge wraps retrieve_relevant_chunks() in core/services/retrieval.py, which runs two retrievals in parallel and merges them:
query ──┬──▶ embed (text-embedding-3-small, 1536d) ──▶ pgvector ANN ─┐
│ ├─▶ RRF ─▶ top-K
└──▶ plainto_tsquery('english', ...) ─▶ tsvector / ts_rank ──┘
1 / (60 + rank)
Vector search nails semantic paraphrase ("set off" vs. "offset"). BM25 over the Postgres tsvector nails rare proper nouns the embedding flattens ("Section 115BBH", "194A"). Reciprocal Rank Fusion lets both win without either dominating: each chunk scores Σ 1 / (60 + rank) across the two lists, top-K wins.
Chunking is heading-aware: every chunk has its breadcrumb prepended before embedding, so the heading context rides into both the embedding and the tsvector. For our marginal-relief leg, the chunk that ranked first came from the slabs document, retrieved with its breadcrumb intact:
Income Tax Slabs FY 2025-26 > Marginal Relief
Marginal relief applies at the 87A rebate boundary and at each surcharge
threshold. When income slightly exceeds a threshold, the additional tax
cannot exceed the additional income above the threshold.
The breadcrumb (Income Tax Slabs FY 2025-26 > Marginal Relief) is the difference between a paragraph that ranks first and one that sits unranked in embedding space. It is also what gave the model the "₹12 lakh 87A rebate threshold and surcharge thresholds" framing it used in its final answer.
Hop two: the agent narrates the answer
Tool outputs are JSON-serialized and appended to the message list as function_call_output items, each tagged with the call_id of the function call it answers. The loop re-runs client.responses.parse(...) against the enriched message list. This time, no tool calls. The model emits a final answer constrained to the AssistantAnswerSchema: an answer string, a reasoning_summary, a confidence value, a next_actions list, an optional disclaimer, and an internal topic field that the composer uses to dispatch.
For our query, hop 2's parsed output:
hop=2 final output (parsed against AssistantAnswerSchema):
confidence: "high"
answer: "- In the new regime, tax on **₹15 lakh** is **₹97,500** for
FY 2025-26.\n- Under the old regime..."
reasoning_summary: "- Used the income-tax calculator for FY 2025-26
and the percentile tool for ₹15 lakh income..."
next_actions: ["Compare old vs new for my deductions",
"What income is top 10%?",
"How is 87A rebate applied?"]
disclaimer: null
topic: <internal value that maps to the answered branch>
Every number in answer traces back to a tool output: 97500, 257400, 159900 from compute_income_tax; 17.42 and 82.58 from compute_income_percentile_forward; the "₹12 lakh 87A rebate threshold and surcharge thresholds" framing from a retrieved chunk. The model wrote the prose; it did not invent the figures. disclaimer: null is fine. The composer injects it based on which tools fired, not on what the model thinks. And if Pydantic parsing fails (e.g., next_actions has four items), the API errors and the loop catches it. The model never delivers an unstructured response downstream.
The composer ships it
The model's AssistantAnswerSchema is internal. The public response (what the user's browser ultimately receives) is AssistantResponseSchema: stricter rules, a coarser four-value status enum, and no topic field at all.
The translation happens in _compose_response:
def _compose_response(
ctx: AgentContext,
answer: AssistantAnswerSchema,
) -> AssistantResponseSchema:
if answer.topic == "refused":
return build_orchestrator_refused_response(...)
if answer.topic == "out_of_scope":
return build_orchestrator_out_of_scope_response(...)
if answer.topic == "needs_clarification":
return build_orchestrator_needs_clarification_response(...)
return build_orchestrator_answered_response(
request_id=ctx.request_id,
conversation_id=ctx.conversation_id,
answer=answer,
retrieved_chunks=ctx.collected_chunks,
tools_called=ctx.tools_called,
)
Four branches, dispatched on topic. For refused and out_of_scope, the composer discards the model's answer text and substitutes hardcoded copy. The model is welcome to refuse, but it is not welcome to write the refusal. Hardcoded text means legal review happens once, not every release. For needs_clarification, the model's text passes through (the clarifying question must be context-specific) but citations and disclaimer are forced empty.
For our query, the topic falls through to the answered branch. The composer passes the model's prose through, strips the internal topic field, looks at tools_called to inject the right citation (calculator answers get the income-tax-calculator CTA), and injects the FY-scoped CA disclaimer. The public schema then enforces hard constraints at the boundary: next_actions non-blank, max three. The model already agreed to this shape in the structured output; the composer re-checks at the seam anyway.
topic enum (model output)
│
┌──────────────┼───────────────┬───────────────┐
▼ ▼ ▼ ▼
answered needs_clarif. out_of_scope refused
│ │ │ │
│ │ ▼ ▼
│ │ (discard model (discard model
│ │ text, hardcode text, hardcode
│ │ answer) answer)
│ │ │ │
│ ▼ │ │
│ (force citations=[], │ │
│ disclaimer=null) │ │
│ │ │ │
▼ ▼ ▼ ▼
strip internal field (topic)
│
▼
AssistantResponseSchema ──▶ public JSON
Structured output is a guardrail, not a convenience. If any model output can leak straight to users, you've outsourced your domain correctness to a stochastic process.
How we know it's not lying
The whole post argues that the bot doesn't compute its own numbers. That claim needs to be verified, not just asserted.
Two layers do the work.
Calculators are pinned with golden tests:
def test_5L_new_regime_full_rebate():
r = compute_income_tax(gross_income=500000, regime="new")
assert r.new.total_tax == 0
assert r.new.taxable_income == 425000
def test_12L_new_regime_at_rebate_threshold():
r = compute_income_tax(gross_income=1200000, regime="new")
assert r.new.total_tax == 0
def test_marginal_relief_above_12L_new_regime():
"""Taxable just over 12L → total tax capped at the excess income."""
...
These run in under a second. Each names a canonical scenario and pins it to a number. The calculator is the source of truth for the assistant; the tests are the guard rail that keep the calculator honest as slabs and thresholds change year to year.
Retrieval is checked by a small golden-query eval. tests/retrieval/golden_queries.jsonl carries cases like:
{
"id": "vda_loss_setoff_salary",
"query": "I lost 50000 on Bitcoin, can I set off against my salary income?",
"expected_text_substr": [
"loss from transfer of virtual digital asset",
"vda losses cannot be set off",
"115bbh"
],
"failure_mode": "TRUE_FAIL"
}
scripts/retrieval_eval.py runs each case through retrieve_relevant_chunks() and reports two numbers:
- Hit@K: did any chunk in the top K contain at least one of the expected substrings (case-insensitive).
- MRR: mean reciprocal rank of the first hit, averaged across cases.
Each case carries multiple substring variants; a chunk counts as a hit if any one matches. That trades precision for resilience: a single brittle needle would mean a rephrasing of the source paragraph fails the test for no real reason.
This set is intentionally small. It's a smoke detector, not a quality measurement. But cheap evals catch regressions unit tests cannot. The unit tests verify code; the eval verifies behaviour. They run in CI and they have caught real shifts in chunk ranking that nothing else would have.
What this shape buys you
Three transferable ideas, stated plainly.
1. The model is a dispatcher, not the source of truth. Anything that has to be deterministic (numbers, business rules, contracts) belongs in a tool. The model picks tools and narrates results. Drawing the line between "this can be wrong" and "this must be right" is the boundary between the model and the calculators. A lot of decisions follow from that line.
2. Structured output is a guardrail. Pydantic-enforced output schemas turn the model's free-text into a typed value the rest of your application can rely on. The composer is then where domain rules live, not the prompt. Tell the model what to do is brittle; make it impossible for the model to return the wrong shape is not.
3. A small golden eval is the cheapest production-confidence tool you have. A small set of retrieval goldens catches regressions a unit-test suite cannot see. A small set of calculator goldens catches what a retrieval eval cannot. Neither is a substitute for the other, and both are cheap enough to never not have.
The user who asked our anchor query got a correct answer not because the model is smart, but because the architecture refuses to let the model be wrong about anything that has a right answer. The calculator wrote the numbers. The retrieved chunk wrote the framing. The composer wrote the contract. The model wrote the prose. Each layer did one job, and only the prose layer was allowed to be creative.
Top comments (0)