<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kartikey rajvaidya</title>
    <description>The latest articles on DEV Community by kartikey rajvaidya (@kartikeyraj).</description>
    <link>https://dev.to/kartikeyraj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891952%2F8a2c4d01-723c-4855-9581-98e3a653c45f.png</url>
      <title>DEV Community: kartikey rajvaidya</title>
      <link>https://dev.to/kartikeyraj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kartikeyraj"/>
    <language>en</language>
    <item>
      <title>How the itrstats tax assistant works: one query, every layer</title>
      <dc:creator>kartikey rajvaidya</dc:creator>
      <pubDate>Mon, 18 May 2026 21:47:48 +0000</pubDate>
      <link>https://dev.to/kartikeyraj/how-the-itrstats-tax-assistant-works-one-query-every-layer-4efd</link>
      <guid>https://dev.to/kartikeyraj/how-the-itrstats-tax-assistant-works-one-query-every-layer-4efd</guid>
      <description>&lt;p&gt;This post walks through how the &lt;strong&gt;itrstats (&lt;a href="https://itrstats.in" rel="noopener noreferrer"&gt;https://itrstats.in&lt;/a&gt;)&lt;/strong&gt; tax assistant handles a single compound user question, end to end through every layer of the backend.&lt;/p&gt;

&lt;p&gt;A user types this in:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;What's tax on ₹15 lakh in new regime, what percentile am I in, and is the marginal relief rule relevant here?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The response that comes back:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;In the new regime, tax on &lt;strong&gt;₹15 lakh&lt;/strong&gt; is &lt;strong&gt;₹97,500&lt;/strong&gt; for FY 2025-26.&lt;/li&gt;
&lt;li&gt;Under the old regime, the same income would be &lt;strong&gt;₹2,57,400&lt;/strong&gt;, so the new regime is cheaper by &lt;strong&gt;₹1,59,900&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You are in roughly the &lt;strong&gt;top 17.42%&lt;/strong&gt; of Indian taxpayers, with about &lt;strong&gt;82.58%&lt;/strong&gt; earning less.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marginal relief is not relevant here&lt;/strong&gt; because it applies around the &lt;strong&gt;₹12 lakh 87A rebate threshold&lt;/strong&gt; and surcharge thresholds, not at ₹15 lakh.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Behind that response, three tools fired, the model made two passes, and a composer did a final validation strip before anything left the server. The whole thing finishes in about four seconds. &lt;strong&gt;The model did not compute a single number in the entire trace&lt;/strong&gt;. It picked tools, narrated the result, and was kept on the rails by a Pydantic-enforced output schema. This post follows that one query through every layer of the system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────┐   POST /v1/assistant/query   ┌──────────┐   ┌─────────────┐
│ user │ ───────────────────────────▶ │  route   │──▶│   action    │
└──────┘                              └──────────┘   └──────┬──────┘
   ▲                                                        │
   │                                                        ▼
   │                                              ┌─────────────────┐
   │                                              │   agent loop    │
   │                                              │  (MAX_HOPS=3)   │
   │                                              └────────┬────────┘
   │                                                       │
   │                              ┌────────────────────────┼────────────────────────┐
   │                              ▼                        ▼                        ▼
   │                  ┌────────────────────┐   ┌─────────────────────┐   ┌──────────────────┐
   │                  │ compute_income_tax │   │ compute_income_     │   │ retrieve_        │
   │                  │ (pure Python)      │   │ percentile (Python) │   │ knowledge (RAG)  │
   │                  └─────────┬──────────┘   └──────────┬──────────┘   └────────┬─────────┘
   │                            │                         │                       │
   │                            └─────────────────────────┴───────────────────────┘
   │                                                      │
   │                                                      ▼
   │                                            ┌──────────────────┐
   │                                            │ agent (hop 2):   │
   │                                            │ narrate answer   │
   │                                            └────────┬─────────┘
   │                                                     ▼
   │                                            ┌──────────────────┐
   └──── public response  ◀──────────────────── │    composer      │
                                                └──────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hop zero: the request arrives
&lt;/h2&gt;

&lt;p&gt;The POST lands at &lt;code&gt;/v1/assistant/query&lt;/code&gt;. The route is intentionally thin: validate, rate-limit, call one action, return. Orchestration lives in the action layer, which does the boring-but-necessary plumbing before the agent ever runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resolves a conversation ID, generating a new one if needed.&lt;/li&gt;
&lt;li&gt;Persists the request row to Postgres &lt;em&gt;before&lt;/em&gt; the model is called, so a downstream crash still leaves a record of what was asked.&lt;/li&gt;
&lt;li&gt;Loads recent conversation turns and replays them into the agent's input as alternating user/assistant messages. There is no server-side thread state in the model.&lt;/li&gt;
&lt;li&gt;Hands off to &lt;code&gt;run_agent&lt;/code&gt;, then persists the final response on the way back out.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the route → action layer. The interesting work starts inside &lt;code&gt;run_agent&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hop one: the agent decides what to call
&lt;/h2&gt;

&lt;p&gt;The agent seeds its input list with system prompt, replayed turns, and the current query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_build_input_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;build_knowledge_system_prompt&lt;/span&gt;&lt;span class="p"&gt;()}]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;turn&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_turns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assistant_query_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model call uses OpenAI's Responses API with three knobs that matter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AssistantConfig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPENAI_ANSWER_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;input_list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                         &lt;span class="c1"&gt;# JSON Schema for each registered tool
&lt;/span&gt;    &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parallel_tool_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;text_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AssistantAnswerSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Pydantic-enforced output
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tools=&lt;/code&gt; is the JSON-Schema list of registered tools the model can call. &lt;code&gt;parallel_tool_calls=True&lt;/code&gt; lets the model request several tool calls in one response instead of one at a time. &lt;code&gt;text_format=AssistantAnswerSchema&lt;/code&gt; constrains the final answer to a Pydantic schema, so once the model stops calling tools it cannot return free-form text.&lt;/p&gt;

&lt;p&gt;For our ₹15 lakh query, hop 1's output is three function calls in one response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hop=1 model output:
  function_call: compute_income_tax
    arguments: { "gross_income": 1500000, "fy": "2025-26",
                 "regime": "both", "age_category": "general",
                 "deductions": { all six keys, all zero } }
  function_call: compute_income_percentile_forward
    arguments: { "income": 1500000 }
  function_call: retrieve_knowledge
    arguments: { "query": "marginal relief new regime 87A rebate threshold",
                 "decompose": false }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No final answer this hop. Just three calls. The loop body executes them and re-runs the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hop&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MAX_HOPS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="n"&gt;function_calls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_detect_function_calls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;function_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_parsed&lt;/span&gt;   &lt;span class="c1"&gt;# final answer
&lt;/span&gt;
    &lt;span class="n"&gt;input_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_serialize_response_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;function_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;output_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_execute_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;input_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_call_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output_str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;MAX_HOPS = 3&lt;/code&gt;. Past the third hop, a forced final-pass message strips the tool registry and demands an answer. The loop cannot run away.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   seed messages: [system_prompt, ...history, current_query]
                                │
                                ▼
   ┌──────────[Responses API call with tools]────────────┐
   │                            │                         │
   ▼                            ▼                         │
function_calls?            final answer?                  │
   │                            │                         │
   │ yes:                       │ yes: return             │
   │  execute tools,            │                         │
   │  append outputs                                      │
   │                                                      │
   └──────── hop++ &amp;lt; MAX_HOPS=3 ──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The LLM is the dispatcher, not the calculator. A model that hallucinates tax math is a liability. A model that picks the right tool and forwards the user's numbers is a feature.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The three tools fire
&lt;/h2&gt;

&lt;p&gt;Each registered tool is a &lt;code&gt;ToolSpec&lt;/code&gt; with four fields: a name, a description, a JSON Schema for its parameters, and an async handler that takes an &lt;code&gt;AgentContext&lt;/code&gt; (a per-turn struct carrying request metadata, prior turns, and a &lt;code&gt;collected_chunks&lt;/code&gt; field that retrieval tools deposit into) plus the model's arguments and returns a dict. The model sees the first three; the handler runs server-side. For the query we're tracing, the model exercises three tools: &lt;code&gt;compute_income_tax&lt;/code&gt;, &lt;code&gt;compute_income_percentile&lt;/code&gt;, and &lt;code&gt;retrieve_knowledge&lt;/code&gt;. The first two are deterministic Python; the third is stochastic retrieval. That split, between tools whose output you can be confident in and tools whose output you cannot, is the architecturally interesting part of the registry, and we follow each in turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  compute_income_tax
&lt;/h3&gt;

&lt;p&gt;The tool wrapper at &lt;code&gt;core/orchestrator/tools/compute_income_tax.py&lt;/code&gt; exists for one reason, and its description states it plainly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;ALWAYS call this tool for any tax-computation request. Do NOT compute slab tax yourself. … Copy the returned &lt;code&gt;formatted_summary&lt;/code&gt; VERBATIM into your answer. Do not paraphrase numbers.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The calculator computes; the model transcribes.&lt;/p&gt;

&lt;p&gt;Two layers sit behind that contract. The wrapper validates inputs and forwards to a pure-Python calculator at &lt;code&gt;core/tools/calculators/income_tax.py&lt;/code&gt;: slab tables for FY 2025-26, 87A rebate, surcharge bands, 4% cess, two flavours of marginal relief (rebate-boundary and surcharge-threshold).&lt;/p&gt;

&lt;p&gt;For our ₹15 lakh query, the model emitted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INPUT  → compute_income_tax({
  "gross_income": 1500000,
  "fy": "2025-26",
  "age_category": "general",
  "regime": "both",
  "deductions": { "sec_80c": 0, "sec_80d": 0, "hra_exempt": 0,
                  "home_loan_interest": 0, "nps_80ccd_1b": 0,
                  "nps_80ccd_2_employer": 0 }
})

OUTPUT → {
  "new": { "taxable_income": 1425000, "base_tax": 93750,
           "cess": 3750, "total_tax": 97500,
           "effective_rate_pct": 6.5 },
  "old": { "total_tax": 257400 },
  "cheaper_regime": "new",
  "savings_under_cheaper": 159900,
  "formatted_summary": "Under new regime, tax on ₹15,00,000 is..."
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to notice. The model passed &lt;code&gt;gross_income: 1500000&lt;/code&gt; (a literal integer in rupees, not the string &lt;code&gt;"15 LPA"&lt;/code&gt;) because the tool description spells out the conversion. The tool returned &lt;code&gt;97500&lt;/code&gt; after applying the default ₹75,000 standard deduction (&lt;code&gt;1500000 - 75000 = 1425000&lt;/code&gt; taxable). The model did neither computation. Both happened inside the calculator.&lt;/p&gt;

&lt;p&gt;The calculator returns a typed &lt;code&gt;IncomeTaxRegimeBreakdown&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IncomeTaxRegimeBreakdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;regime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;old&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;standard_deduction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;taxable_income&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;base_tax&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;rebate_87a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;surcharge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;cess&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;marginal_relief_rebate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;marginal_relief_surcharge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;total_tax&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;effective_rate_pct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The canonical case is pinned in a golden test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_15L_new_regime_no_deductions&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_income_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gross_income&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tax&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;97500&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Moving arithmetic out of the model is only useful if the calculator stays correct over time. These tests run in under a second and each pins a canonical case to a number.&lt;/p&gt;

&lt;h3&gt;
  
  
  compute_income_percentile
&lt;/h3&gt;

&lt;p&gt;The percentile tool follows the same shape (a thin wrapper around a pure-Python function) and reads from a real ITR filing-statistics dataset. For our query, the model called it with the user's income:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INPUT  → compute_income_percentile_forward({ "income": 1500000 })

OUTPUT → {
  "top_percent": 17.42,
  "percentile_from_bottom": 82.58,
  "financial_year": "2024-25",
  "formatted_summary": "₹15 lakh puts you in the top 17.42%..."
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;17.42%&lt;/code&gt; and &lt;code&gt;82.58%&lt;/code&gt; in the user-visible answer are these two fields, copied through.&lt;/p&gt;

&lt;h3&gt;
  
  
  retrieve_knowledge
&lt;/h3&gt;

&lt;p&gt;The third tool that fired is the only one that isn't a pure calculator. &lt;code&gt;retrieve_knowledge&lt;/code&gt; wraps &lt;code&gt;retrieve_relevant_chunks()&lt;/code&gt; in &lt;code&gt;core/services/retrieval.py&lt;/code&gt;, which runs two retrievals in parallel and merges them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;query ──┬──▶ embed (text-embedding-3-small, 1536d) ──▶ pgvector ANN ─┐
        │                                                             ├─▶ RRF ─▶ top-K
        └──▶ plainto_tsquery('english', ...)  ─▶ tsvector / ts_rank ──┘
                                                          1 / (60 + rank)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Vector search nails semantic paraphrase (&lt;em&gt;"set off"&lt;/em&gt; vs. &lt;em&gt;"offset"&lt;/em&gt;). BM25 over the Postgres &lt;code&gt;tsvector&lt;/code&gt; nails rare proper nouns the embedding flattens (&lt;em&gt;"Section 115BBH"&lt;/em&gt;, &lt;em&gt;"194A"&lt;/em&gt;). Reciprocal Rank Fusion lets both win without either dominating: each chunk scores &lt;code&gt;Σ 1 / (60 + rank)&lt;/code&gt; across the two lists, top-K wins.&lt;/p&gt;

&lt;p&gt;Chunking is heading-aware: every chunk has its breadcrumb prepended before embedding, so the heading context rides into both the embedding and the tsvector. For our marginal-relief leg, the chunk that ranked first came from the slabs document, retrieved with its breadcrumb intact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Income Tax Slabs FY 2025-26 &amp;gt; Marginal Relief

Marginal relief applies at the 87A rebate boundary and at each surcharge
threshold. When income slightly exceeds a threshold, the additional tax
cannot exceed the additional income above the threshold.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The breadcrumb (&lt;code&gt;Income Tax Slabs FY 2025-26 &amp;gt; Marginal Relief&lt;/code&gt;) is the difference between a paragraph that ranks first and one that sits unranked in embedding space. It is also what gave the model the &lt;em&gt;"₹12 lakh 87A rebate threshold and surcharge thresholds"&lt;/em&gt; framing it used in its final answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hop two: the agent narrates the answer
&lt;/h2&gt;

&lt;p&gt;Tool outputs are JSON-serialized and appended to the message list as &lt;code&gt;function_call_output&lt;/code&gt; items, each tagged with the &lt;code&gt;call_id&lt;/code&gt; of the function call it answers. The loop re-runs &lt;code&gt;client.responses.parse(...)&lt;/code&gt; against the enriched message list. This time, no tool calls. The model emits a final answer constrained to the &lt;code&gt;AssistantAnswerSchema&lt;/code&gt;: an &lt;code&gt;answer&lt;/code&gt; string, a &lt;code&gt;reasoning_summary&lt;/code&gt;, a &lt;code&gt;confidence&lt;/code&gt; value, a &lt;code&gt;next_actions&lt;/code&gt; list, an optional &lt;code&gt;disclaimer&lt;/code&gt;, and an internal &lt;code&gt;topic&lt;/code&gt; field that the composer uses to dispatch.&lt;/p&gt;

&lt;p&gt;For our query, hop 2's parsed output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hop=2 final output (parsed against AssistantAnswerSchema):
  confidence: "high"
  answer: "- In the new regime, tax on **₹15 lakh** is **₹97,500** for
           FY 2025-26.\n- Under the old regime..."
  reasoning_summary: "- Used the income-tax calculator for FY 2025-26
           and the percentile tool for ₹15 lakh income..."
  next_actions: ["Compare old vs new for my deductions",
                 "What income is top 10%?",
                 "How is 87A rebate applied?"]
  disclaimer: null
  topic: &amp;lt;internal value that maps to the answered branch&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every number in &lt;code&gt;answer&lt;/code&gt; traces back to a tool output: &lt;code&gt;97500&lt;/code&gt;, &lt;code&gt;257400&lt;/code&gt;, &lt;code&gt;159900&lt;/code&gt; from &lt;code&gt;compute_income_tax&lt;/code&gt;; &lt;code&gt;17.42&lt;/code&gt; and &lt;code&gt;82.58&lt;/code&gt; from &lt;code&gt;compute_income_percentile_forward&lt;/code&gt;; the &lt;em&gt;"₹12 lakh 87A rebate threshold and surcharge thresholds"&lt;/em&gt; framing from a retrieved chunk. The model wrote the prose; it did not invent the figures. &lt;code&gt;disclaimer: null&lt;/code&gt; is fine. The composer injects it based on which tools fired, not on what the model thinks. And if Pydantic parsing fails (e.g., &lt;code&gt;next_actions&lt;/code&gt; has four items), the API errors and the loop catches it. The model never delivers an unstructured response downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  The composer ships it
&lt;/h2&gt;

&lt;p&gt;The model's &lt;code&gt;AssistantAnswerSchema&lt;/code&gt; is internal. The public response (what the user's browser ultimately receives) is &lt;code&gt;AssistantResponseSchema&lt;/code&gt;: stricter rules, a coarser four-value status enum, and no &lt;code&gt;topic&lt;/code&gt; field at all.&lt;/p&gt;

&lt;p&gt;The translation happens in &lt;code&gt;_compose_response&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_compose_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AssistantAnswerSchema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AssistantResponseSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refused&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;build_orchestrator_refused_response&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;out_of_scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;build_orchestrator_out_of_scope_response&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;needs_clarification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;build_orchestrator_needs_clarification_response&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;build_orchestrator_answered_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;conversation_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;retrieved_chunks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collected_chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools_called&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tools_called&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four branches, dispatched on &lt;code&gt;topic&lt;/code&gt;. For &lt;code&gt;refused&lt;/code&gt; and &lt;code&gt;out_of_scope&lt;/code&gt;, the composer &lt;strong&gt;discards the model's &lt;code&gt;answer&lt;/code&gt; text&lt;/strong&gt; and substitutes hardcoded copy. The model is welcome to refuse, but it is not welcome to write the refusal. Hardcoded text means legal review happens once, not every release. For &lt;code&gt;needs_clarification&lt;/code&gt;, the model's text passes through (the clarifying question must be context-specific) but &lt;code&gt;citations&lt;/code&gt; and &lt;code&gt;disclaimer&lt;/code&gt; are forced empty.&lt;/p&gt;

&lt;p&gt;For our query, the topic falls through to the &lt;code&gt;answered&lt;/code&gt; branch. The composer passes the model's prose through, strips the internal &lt;code&gt;topic&lt;/code&gt; field, looks at &lt;code&gt;tools_called&lt;/code&gt; to inject the right citation (calculator answers get the income-tax-calculator CTA), and injects the FY-scoped CA disclaimer. The public schema then enforces hard constraints at the boundary: &lt;code&gt;next_actions&lt;/code&gt; non-blank, max three. The model already agreed to this shape in the structured output; the composer re-checks at the seam anyway.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        topic enum (model output)
                  │
   ┌──────────────┼───────────────┬───────────────┐
   ▼              ▼               ▼               ▼
answered     needs_clarif.    out_of_scope     refused
   │              │               │               │
   │              │               ▼               ▼
   │              │       (discard model      (discard model
   │              │        text, hardcode      text, hardcode
   │              │        answer)             answer)
   │              │               │               │
   │              ▼               │               │
   │     (force citations=[],     │               │
   │      disclaimer=null)        │               │
   │              │               │               │
   ▼              ▼               ▼               ▼
        strip internal field (topic)
                       │
                       ▼
              AssistantResponseSchema  ──▶  public JSON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Structured output is a guardrail, not a convenience. If any model output can leak straight to users, you've outsourced your domain correctness to a stochastic process.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How we know it's not lying
&lt;/h2&gt;

&lt;p&gt;The whole post argues that the bot doesn't compute its own numbers. That claim needs to be &lt;em&gt;verified&lt;/em&gt;, not just asserted.&lt;/p&gt;

&lt;p&gt;Two layers do the work.&lt;/p&gt;

&lt;p&gt;Calculators are pinned with golden tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_5L_new_regime_full_rebate&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_income_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gross_income&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tax&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taxable_income&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;425000&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_12L_new_regime_at_rebate_threshold&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_income_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gross_income&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1200000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tax&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_marginal_relief_above_12L_new_regime&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Taxable just over 12L → total tax capped at the excess income.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These run in under a second. Each names a canonical scenario and pins it to a number. The calculator is the source of truth for the assistant; the tests are the guard rail that keep the calculator honest as slabs and thresholds change year to year.&lt;/p&gt;

&lt;p&gt;Retrieval is checked by a small golden-query eval. &lt;code&gt;tests/retrieval/golden_queries.jsonl&lt;/code&gt; carries cases like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vda_loss_setoff_salary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"I lost 50000 on Bitcoin, can I set off against my salary income?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expected_text_substr"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"loss from transfer of virtual digital asset"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"vda losses cannot be set off"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"115bbh"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"failure_mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TRUE_FAIL"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;scripts/retrieval_eval.py&lt;/code&gt; runs each case through &lt;code&gt;retrieve_relevant_chunks()&lt;/code&gt; and reports two numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hit@K&lt;/strong&gt;: did any chunk in the top K contain at least one of the expected substrings (case-insensitive).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MRR&lt;/strong&gt;: mean reciprocal rank of the first hit, averaged across cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each case carries multiple substring variants; a chunk counts as a hit if any one matches. That trades precision for resilience: a single brittle needle would mean a rephrasing of the source paragraph fails the test for no real reason.&lt;/p&gt;

&lt;p&gt;This set is intentionally small. It's a smoke detector, not a quality measurement. But cheap evals catch regressions unit tests cannot. The unit tests verify code; the eval verifies behaviour. They run in CI and they have caught real shifts in chunk ranking that nothing else would have.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this shape buys you
&lt;/h2&gt;

&lt;p&gt;Three transferable ideas, stated plainly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The model is a dispatcher, not the source of truth.&lt;/strong&gt; Anything that has to be deterministic (numbers, business rules, contracts) belongs in a tool. The model picks tools and narrates results. Drawing the line between &lt;em&gt;"this can be wrong"&lt;/em&gt; and &lt;em&gt;"this must be right"&lt;/em&gt; is the boundary between the model and the calculators. A lot of decisions follow from that line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structured output is a guardrail.&lt;/strong&gt; Pydantic-enforced output schemas turn the model's free-text into a typed value the rest of your application can rely on. The composer is then where domain rules live, not the prompt. &lt;em&gt;Tell the model what to do&lt;/em&gt; is brittle; &lt;em&gt;make it impossible for the model to return the wrong shape&lt;/em&gt; is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. A small golden eval is the cheapest production-confidence tool you have.&lt;/strong&gt; A small set of retrieval goldens catches regressions a unit-test suite cannot see. A small set of calculator goldens catches what a retrieval eval cannot. Neither is a substitute for the other, and both are cheap enough to never not have.&lt;/p&gt;

&lt;p&gt;The user who asked our anchor query got a correct answer not because the model is smart, but because the architecture refuses to let the model be wrong about anything that has a right answer. The calculator wrote the numbers. The retrieved chunk wrote the framing. The composer wrote the contract. The model wrote the prose. Each layer did one job, and only the prose layer was allowed to be creative.&lt;/p&gt;

</description>
      <category>python</category>
      <category>llm</category>
      <category>agents</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
