<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ayush Shekhar</title>
    <description>The latest articles on DEV Community by Ayush Shekhar (@ayushh0110).</description>
    <link>https://dev.to/ayushh0110</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898782%2Fad3df0e7-5f5e-45e7-93c4-495fd4566407.jpeg</url>
      <title>DEV Community: Ayush Shekhar</title>
      <link>https://dev.to/ayushh0110</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ayushh0110"/>
    <language>en</language>
    <item>
      <title>From Heuristics to Fine-Tuning: Teaching a Model to Use Tools</title>
      <dc:creator>Ayush Shekhar</dc:creator>
      <pubDate>Sun, 26 Apr 2026 13:28:09 +0000</pubDate>
      <link>https://dev.to/ayushh0110/from-heuristics-to-fine-tuning-teaching-a-model-to-use-tools-1c9g</link>
      <guid>https://dev.to/ayushh0110/from-heuristics-to-fine-tuning-teaching-a-model-to-use-tools-1c9g</guid>
      <description>&lt;p&gt;&lt;em&gt;How I replaced 200 lines of regex with a fine-tuned 7B model — and why it was worth it.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I built an &lt;a href="https://github.com/ayushh0110/autonomous-agent" rel="noopener noreferrer"&gt;autonomous AI agent&lt;/a&gt; with 9 tools: web search, calculator, weather, Wikipedia, translation, and more. The first question every request must answer is deceptively simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Which tool should I use?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My first solution was a heuristic classifier — a function called &lt;code&gt;classify_query()&lt;/code&gt; that uses regex patterns to detect intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 200+ lines of patterns like this:
&lt;/span&gt;&lt;span class="n"&gt;_SEARCH_INDICATORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(latest|current|news|today|recent|who won|score|price|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stock|update|happening|trending|release|launched)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;_KNOWLEDGE_INDICATORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(explain|what is|how does|define|difference between|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;why do|concept of|overview|meaning of|works)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It worked. About 75% of the time.&lt;/p&gt;

&lt;p&gt;The remaining 25% was a graveyard of edge cases: "say hello in Japanese" (needs &lt;code&gt;translate&lt;/code&gt;, matched nothing), "what's 15% of 2850" (needs &lt;code&gt;calculator&lt;/code&gt;, matched &lt;code&gt;what's&lt;/code&gt; → routed to search), "compare React vs Vue" (needs autonomous executor, matched &lt;code&gt;compare&lt;/code&gt; → routed to direct answer).&lt;/p&gt;

&lt;p&gt;Every fix introduced new regressions. &lt;strong&gt;Regex-based routing doesn't scale.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;What if the model itself could learn the routing? Not a giant foundation model — a small, fast 7B model fine-tuned specifically for this task. The hypothesis:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A QLoRA-adapted 7B model trained on 1K high-quality tool-call traces should outperform hand-crafted regex, with comparable latency.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This became &lt;a href="https://github.com/ayushh0110/toolforge" rel="noopener noreferrer"&gt;&lt;strong&gt;ToolForge&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Generating Training Data (The Hard Part)
&lt;/h2&gt;

&lt;p&gt;I had 9 tools but no labeled dataset. Creating one manually would take weeks. Instead, I used &lt;strong&gt;teacher distillation&lt;/strong&gt; — using a stronger model (Gemini 2.5 Flash) to generate high-quality training examples.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Distillation Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User queries (generated) → Gemini 2.5 Flash → Structured tool-call traces → Filtered dataset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trick was &lt;strong&gt;diversity&lt;/strong&gt;. I needed queries covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-tool requests ("What's the weather in Tokyo?")&lt;/li&gt;
&lt;li&gt;Multi-tool chains ("What's the weather in Tokyo and convert 25°C to Fahrenheit?")&lt;/li&gt;
&lt;li&gt;No-tool queries ("Explain recursion")&lt;/li&gt;
&lt;li&gt;Ambiguous queries ("Tell me about Python" — search or direct answer?)&lt;/li&gt;
&lt;li&gt;Edge cases ("sqrt of 44567" — calculator, not search)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I built a &lt;code&gt;ClientPool&lt;/code&gt; that rotates across 6 free-tier Gemini API keys to avoid rate limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ClientPool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Round-robin pool of (key, model) slots for maximum throughput.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;next_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Pick the slot that has rested the longest
&lt;/span&gt;        &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_slots&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_used&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_min_gap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_min_gap&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After filtering for quality (valid JSON, correct schema, no hallucinated tools), I had &lt;strong&gt;1,173 clean examples&lt;/strong&gt; — enough for fine-tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dataset Distribution
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;web_search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;287&lt;/td&gt;
&lt;td&gt;24%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calculator&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;156&lt;/td&gt;
&lt;td&gt;13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;weather&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;143&lt;/td&gt;
&lt;td&gt;12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;translate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;132&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;wikipedia&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;11%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;no_tool&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;119&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dictionary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;td&gt;7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;datetime&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;unit_converter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The distribution is intentionally skewed toward &lt;code&gt;web_search&lt;/code&gt; — mirroring real-world query patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Training with QLoRA
&lt;/h2&gt;

&lt;p&gt;I trained on a Kaggle T4 GPU (free tier). The key insight: &lt;strong&gt;you don't need an A100 for fine-tuning.&lt;/strong&gt; QLoRA with 4-bit NF4 quantization fits a 7B model in ~6GB VRAM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_use_double_quant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Double quantization saves ~0.4GB
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;lora_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                    &lt;span class="c1"&gt;# LoRA rank
&lt;/span&gt;    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Scaling factor (alpha/r = 2)
&lt;/span&gt;    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why these choices?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;r=64&lt;/strong&gt;: Higher rank = more parameters = more capacity to learn tool routing patterns. I tested r=16 (too small) and r=64 (sweet spot).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All attention + MLP layers&lt;/strong&gt;: Tool routing requires understanding query intent (attention) AND mapping it to structured output (MLP). Targeting only attention heads wasn't enough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;alpha=128 (2×r)&lt;/strong&gt;: Standard scaling that prevents gradient instability.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 3: The Ablation Study
&lt;/h2&gt;

&lt;p&gt;This is where the project goes from "I fine-tuned a model" to "I systematically evaluated design choices." I ran 4 experiments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Base Model&lt;/th&gt;
&lt;th&gt;LoRA Rank&lt;/th&gt;
&lt;th&gt;LR&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Mistral-7B-Instruct-v0.3&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;2e-4&lt;/td&gt;
&lt;td&gt;78.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Mistral-7B-Instruct-v0.3&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;2e-4&lt;/td&gt;
&lt;td&gt;81.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Qwen2.5-7B-Instruct&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;2e-4&lt;/td&gt;
&lt;td&gt;83.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Qwen2.5-7B-Instruct&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;64&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2e-4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All tracked on &lt;a href="https://wandb.ai/shekharayush56-cognizant/toolforge" rel="noopener noreferrer"&gt;Weights &amp;amp; Biases&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Findings
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Qwen &amp;gt; Mistral for tool routing (+4.5%)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Qwen2.5-7B-Instruct has stronger structured output capabilities out of the box. Its chat template naturally handles tool-call JSON, while Mistral required more prompt engineering to produce valid output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. r=64 &amp;gt; r=16 for both models (+3-4%)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The routing task isn't trivial — the model needs to learn mappings between natural language patterns and 9 discrete tool categories plus argument extraction. r=16 underfits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Eval loss converges by epoch 2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All runs showed minimal improvement after epoch 2, with some showing slight overfitting in epoch 3. &lt;code&gt;load_best_model_at_end=True&lt;/code&gt; was essential.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Integration
&lt;/h2&gt;

&lt;p&gt;The integration into the autonomous agent was designed as a &lt;strong&gt;feature flag&lt;/strong&gt; — zero behavior change in production unless explicitly enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In executor.py
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_toolforge_available&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;toolforge_classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;has_memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;router_source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toolforge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory_hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;has_memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# heuristic fallback
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;toolforge_classify()&lt;/code&gt; function:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Loads the LoRA adapter lazily on first query&lt;/li&gt;
&lt;li&gt;Runs inference with greedy decoding (deterministic routing)&lt;/li&gt;
&lt;li&gt;Parses the model's tool-call output&lt;/li&gt;
&lt;li&gt;Maps specific tools to the agent's decision types (&lt;code&gt;web_search&lt;/code&gt; → &lt;code&gt;needs_search&lt;/code&gt;, no tool → &lt;code&gt;direct_answer&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Returns &lt;code&gt;None&lt;/code&gt; on any failure → heuristic takes over&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production (HF Spaces, CPU)&lt;/strong&gt;: heuristic runs as before&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU-enabled environments&lt;/strong&gt;: ToolForge model handles routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The code is always visible&lt;/strong&gt;: interviewers can see the integration pattern&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Heuristic (Regex)&lt;/th&gt;
&lt;th&gt;ToolForge (QLoRA)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overall Accuracy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~75%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Approach&lt;/td&gt;
&lt;td&gt;200 lines of regex&lt;/td&gt;
&lt;td&gt;Fine-tuned Qwen2.5-7B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;0ms (regex)&lt;/td&gt;
&lt;td&gt;~200ms (GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Handles edge cases&lt;/td&gt;
&lt;td&gt;❌ Constant regressions&lt;/td&gt;
&lt;td&gt;✅ Learned from data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance cost&lt;/td&gt;
&lt;td&gt;High (new regex per bug)&lt;/td&gt;
&lt;td&gt;Low (retrain on new data)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 15% accuracy improvement isn't just a number — it means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Say hello in Japanese" → correctly routes to &lt;code&gt;translate&lt;/code&gt; (was: missed entirely)&lt;/li&gt;
&lt;li&gt;"sqrt(44567)" → correctly routes to &lt;code&gt;calculator&lt;/code&gt; (was: matched "what" → search)&lt;/li&gt;
&lt;li&gt;"Compare React vs Vue for 2026" → correctly routes to &lt;code&gt;autonomous_task&lt;/code&gt; (was: partial match → direct answer)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;More data&lt;/strong&gt;: 1.1K examples is enough for proof-of-concept, but 5K+ would likely push accuracy above 90%. The distillation pipeline can scale — I just ran out of free API quota.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Argument extraction evaluation&lt;/strong&gt;: I evaluated tool &lt;em&gt;selection&lt;/em&gt; accuracy but didn't formally measure argument &lt;em&gt;extraction&lt;/em&gt; quality (e.g., did the model extract "Tokyo" from "weather in Tokyo?"). The traces show it works, but a proper F1 metric would be stronger.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GGUF quantization for CPU inference&lt;/strong&gt;: The current serving path requires GPU. Converting to GGUF and using llama.cpp would enable CPU inference at ~1-2s latency — viable for production on free-tier hosting.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Story
&lt;/h2&gt;

&lt;p&gt;This project isn't about fine-tuning. Fine-tuning is a technique — anyone can run &lt;code&gt;SFTTrainer&lt;/code&gt;. The story is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;I built an agent&lt;/strong&gt; with hand-crafted routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I measured where it failed&lt;/strong&gt; (75% accuracy, constant regex regressions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I generated training data&lt;/strong&gt; using teacher distillation from my own pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I trained and compared models&lt;/strong&gt; with systematic ablation studies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I proved it works&lt;/strong&gt; with quantitative evaluation (86.2% accuracy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I integrated it&lt;/strong&gt; as a production-ready feature flag&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's not a tutorial project. That's the ML engineering loop — identify problem → collect data → train → evaluate → deploy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ToolForge repo&lt;/strong&gt;: &lt;a href="https://github.com/ayushh0110/toolforge" rel="noopener noreferrer"&gt;github.com/ayushh0110/toolforge&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous Agent&lt;/strong&gt;: &lt;a href="https://github.com/ayushh0110/autonomous-agent" rel="noopener noreferrer"&gt;github.com/ayushh0110/autonomous-agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;W&amp;amp;B Dashboard&lt;/strong&gt;: &lt;a href="https://wandb.ai/shekharayush56-cognizant/toolforge" rel="noopener noreferrer"&gt;wandb.ai/shekharayush56-cognizant/toolforge&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live Agent Demo&lt;/strong&gt;: &lt;a href="https://autonomous-agent-one.vercel.app" rel="noopener noreferrer"&gt;autonomous-agent-one.vercel.app&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/ayushh0110" rel="noopener noreferrer"&gt;Ayush Shekhar&lt;/a&gt;. If you're working on tool-use fine-tuning, I'd love to hear what approach you're taking — reach out on &lt;a href="https://linkedin.com/in/ayush-shekhar" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
