DEV Community

Ayush Shekhar
Ayush Shekhar

Posted on

From Heuristics to Fine-Tuning: Teaching a Model to Use Tools

How I replaced 200 lines of regex with a fine-tuned 7B model — and why it was worth it.


The Problem

I built an autonomous AI agent with 9 tools: web search, calculator, weather, Wikipedia, translation, and more. The first question every request must answer is deceptively simple:

Which tool should I use?

My first solution was a heuristic classifier — a function called classify_query() that uses regex patterns to detect intent:

# 200+ lines of patterns like this:
_SEARCH_INDICATORS = re.compile(
    r"\b(latest|current|news|today|recent|who won|score|price|"
    r"stock|update|happening|trending|release|launched)\b", re.IGNORECASE
)

_KNOWLEDGE_INDICATORS = re.compile(
    r"\b(explain|what is|how does|define|difference between|"
    r"why do|concept of|overview|meaning of|works)\b", re.IGNORECASE
)
Enter fullscreen mode Exit fullscreen mode

It worked. About 75% of the time.

The remaining 25% was a graveyard of edge cases: "say hello in Japanese" (needs translate, matched nothing), "what's 15% of 2850" (needs calculator, matched what's → routed to search), "compare React vs Vue" (needs autonomous executor, matched compare → routed to direct answer).

Every fix introduced new regressions. Regex-based routing doesn't scale.


The Idea

What if the model itself could learn the routing? Not a giant foundation model — a small, fast 7B model fine-tuned specifically for this task. The hypothesis:

A QLoRA-adapted 7B model trained on 1K high-quality tool-call traces should outperform hand-crafted regex, with comparable latency.

This became ToolForge.


Step 1: Generating Training Data (The Hard Part)

I had 9 tools but no labeled dataset. Creating one manually would take weeks. Instead, I used teacher distillation — using a stronger model (Gemini 2.5 Flash) to generate high-quality training examples.

The Distillation Pipeline

User queries (generated) → Gemini 2.5 Flash → Structured tool-call traces → Filtered dataset
Enter fullscreen mode Exit fullscreen mode

The trick was diversity. I needed queries covering:

  • Single-tool requests ("What's the weather in Tokyo?")
  • Multi-tool chains ("What's the weather in Tokyo and convert 25°C to Fahrenheit?")
  • No-tool queries ("Explain recursion")
  • Ambiguous queries ("Tell me about Python" — search or direct answer?)
  • Edge cases ("sqrt of 44567" — calculator, not search)

I built a ClientPool that rotates across 6 free-tier Gemini API keys to avoid rate limits:

class ClientPool:
    """Round-robin pool of (key, model) slots for maximum throughput."""

    def next_client(self):
        # Pick the slot that has rested the longest
        best = min(self._slots, key=lambda s: s.last_used)
        elapsed = time.time() - best.last_used
        if elapsed < self._min_gap:
            time.sleep(self._min_gap - elapsed)
        return best
Enter fullscreen mode Exit fullscreen mode

After filtering for quality (valid JSON, correct schema, no hallucinated tools), I had 1,173 clean examples — enough for fine-tuning.

Dataset Distribution

Tool Count %
web_search 287 24%
calculator 156 13%
weather 143 12%
translate 132 11%
wikipedia 128 11%
no_tool 119 10%
dictionary 78 7%
datetime 68 6%
unit_converter 62 5%

The distribution is intentionally skewed toward web_search — mirroring real-world query patterns.


Step 2: Training with QLoRA

I trained on a Kaggle T4 GPU (free tier). The key insight: you don't need an A100 for fine-tuning. QLoRA with 4-bit NF4 quantization fits a 7B model in ~6GB VRAM.

Configuration

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Double quantization saves ~0.4GB
)

lora_config = LoraConfig(
    r=64,                    # LoRA rank
    lora_alpha=128,          # Scaling factor (alpha/r = 2)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
)
Enter fullscreen mode Exit fullscreen mode

Why these choices?

  • r=64: Higher rank = more parameters = more capacity to learn tool routing patterns. I tested r=16 (too small) and r=64 (sweet spot).
  • All attention + MLP layers: Tool routing requires understanding query intent (attention) AND mapping it to structured output (MLP). Targeting only attention heads wasn't enough.
  • alpha=128 (2×r): Standard scaling that prevents gradient instability.

Step 3: The Ablation Study

This is where the project goes from "I fine-tuned a model" to "I systematically evaluated design choices." I ran 4 experiments:

Run Base Model LoRA Rank LR Result
1 Mistral-7B-Instruct-v0.3 16 2e-4 78.4%
2 Mistral-7B-Instruct-v0.3 64 2e-4 81.7%
3 Qwen2.5-7B-Instruct 16 2e-4 83.1%
4 Qwen2.5-7B-Instruct 64 2e-4 86.2%

All tracked on Weights & Biases.

Key Findings

1. Qwen > Mistral for tool routing (+4.5%)

Qwen2.5-7B-Instruct has stronger structured output capabilities out of the box. Its chat template naturally handles tool-call JSON, while Mistral required more prompt engineering to produce valid output.

2. r=64 > r=16 for both models (+3-4%)

The routing task isn't trivial — the model needs to learn mappings between natural language patterns and 9 discrete tool categories plus argument extraction. r=16 underfits.

3. Eval loss converges by epoch 2

All runs showed minimal improvement after epoch 2, with some showing slight overfitting in epoch 3. load_best_model_at_end=True was essential.


Step 4: Integration

The integration into the autonomous agent was designed as a feature flag — zero behavior change in production unless explicitly enabled:

# In executor.py
if is_toolforge_available():
    decision = toolforge_classify(query, memory_hits, has_memory)
    router_source = "toolforge"

if decision is None:
    decision = classify_query(query, memory_hits, has_memory)  # heuristic fallback
Enter fullscreen mode Exit fullscreen mode

The toolforge_classify() function:

  1. Loads the LoRA adapter lazily on first query
  2. Runs inference with greedy decoding (deterministic routing)
  3. Parses the model's tool-call output
  4. Maps specific tools to the agent's decision types (web_searchneeds_search, no tool → direct_answer)
  5. Returns None on any failure → heuristic takes over

This means:

  • Production (HF Spaces, CPU): heuristic runs as before
  • GPU-enabled environments: ToolForge model handles routing
  • The code is always visible: interviewers can see the integration pattern

Results

Metric Heuristic (Regex) ToolForge (QLoRA)
Overall Accuracy ~75% 86.2%
Approach 200 lines of regex Fine-tuned Qwen2.5-7B
Latency 0ms (regex) ~200ms (GPU)
Handles edge cases ❌ Constant regressions ✅ Learned from data
Maintenance cost High (new regex per bug) Low (retrain on new data)

The 15% accuracy improvement isn't just a number — it means:

  • "Say hello in Japanese" → correctly routes to translate (was: missed entirely)
  • "sqrt(44567)" → correctly routes to calculator (was: matched "what" → search)
  • "Compare React vs Vue for 2026" → correctly routes to autonomous_task (was: partial match → direct answer)

What I'd Do Differently

  1. More data: 1.1K examples is enough for proof-of-concept, but 5K+ would likely push accuracy above 90%. The distillation pipeline can scale — I just ran out of free API quota.

  2. Argument extraction evaluation: I evaluated tool selection accuracy but didn't formally measure argument extraction quality (e.g., did the model extract "Tokyo" from "weather in Tokyo?"). The traces show it works, but a proper F1 metric would be stronger.

  3. GGUF quantization for CPU inference: The current serving path requires GPU. Converting to GGUF and using llama.cpp would enable CPU inference at ~1-2s latency — viable for production on free-tier hosting.


The Story

This project isn't about fine-tuning. Fine-tuning is a technique — anyone can run SFTTrainer. The story is:

  1. I built an agent with hand-crafted routing
  2. I measured where it failed (75% accuracy, constant regex regressions)
  3. I generated training data using teacher distillation from my own pipeline
  4. I trained and compared models with systematic ablation studies
  5. I proved it works with quantitative evaluation (86.2% accuracy)
  6. I integrated it as a production-ready feature flag

That's not a tutorial project. That's the ML engineering loop — identify problem → collect data → train → evaluate → deploy.


Links


Built by Ayush Shekhar. If you're working on tool-use fine-tuning, I'd love to hear what approach you're taking — reach out on LinkedIn.

Top comments (0)