Ayush Shekhar

Posted on Apr 26

From Heuristics to Fine-Tuning: Teaching a Model to Use Tools

#ai #machinelearning #productivity #opensource

How I replaced 200 lines of regex with a fine-tuned 7B model — and why it was worth it.

The Problem

I built an autonomous AI agent with 9 tools: web search, calculator, weather, Wikipedia, translation, and more. The first question every request must answer is deceptively simple:

Which tool should I use?

My first solution was a heuristic classifier — a function called classify_query() that uses regex patterns to detect intent:

# 200+ lines of patterns like this:
_SEARCH_INDICATORS = re.compile(
    r"\b(latest|current|news|today|recent|who won|score|price|"
    r"stock|update|happening|trending|release|launched)\b", re.IGNORECASE
)

_KNOWLEDGE_INDICATORS = re.compile(
    r"\b(explain|what is|how does|define|difference between|"
    r"why do|concept of|overview|meaning of|works)\b", re.IGNORECASE
)

It worked. About 75% of the time.

The remaining 25% was a graveyard of edge cases: "say hello in Japanese" (needs translate, matched nothing), "what's 15% of 2850" (needs calculator, matched what's → routed to search), "compare React vs Vue" (needs autonomous executor, matched compare → routed to direct answer).

Every fix introduced new regressions. Regex-based routing doesn't scale.

The Idea

What if the model itself could learn the routing? Not a giant foundation model — a small, fast 7B model fine-tuned specifically for this task. The hypothesis:

A QLoRA-adapted 7B model trained on 1K high-quality tool-call traces should outperform hand-crafted regex, with comparable latency.

This became ToolForge.

Step 1: Generating Training Data (The Hard Part)

I had 9 tools but no labeled dataset. Creating one manually would take weeks. Instead, I used teacher distillation — using a stronger model (Gemini 2.5 Flash) to generate high-quality training examples.

The Distillation Pipeline

User queries (generated) → Gemini 2.5 Flash → Structured tool-call traces → Filtered dataset

The trick was diversity. I needed queries covering:

Single-tool requests ("What's the weather in Tokyo?")
Multi-tool chains ("What's the weather in Tokyo and convert 25°C to Fahrenheit?")
No-tool queries ("Explain recursion")
Ambiguous queries ("Tell me about Python" — search or direct answer?)
Edge cases ("sqrt of 44567" — calculator, not search)

I built a ClientPool that rotates across 6 free-tier Gemini API keys to avoid rate limits:

class ClientPool:
    """Round-robin pool of (key, model) slots for maximum throughput."""

    def next_client(self):
        # Pick the slot that has rested the longest
        best = min(self._slots, key=lambda s: s.last_used)
        elapsed = time.time() - best.last_used
        if elapsed < self._min_gap:
            time.sleep(self._min_gap - elapsed)
        return best

After filtering for quality (valid JSON, correct schema, no hallucinated tools), I had 1,173 clean examples — enough for fine-tuning.

Dataset Distribution

Tool	Count	%
`web_search`	287	24%
`calculator`	156	13%
`weather`	143	12%
`translate`	132	11%
`wikipedia`	128	11%
`no_tool`	119	10%
`dictionary`	78	7%
`datetime`	68	6%
`unit_converter`	62	5%

The distribution is intentionally skewed toward web_search — mirroring real-world query patterns.

Step 2: Training with QLoRA

I trained on a Kaggle T4 GPU (free tier). The key insight: you don't need an A100 for fine-tuning. QLoRA with 4-bit NF4 quantization fits a 7B model in ~6GB VRAM.

Configuration

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Double quantization saves ~0.4GB
)

lora_config = LoraConfig(
    r=64,                    # LoRA rank
    lora_alpha=128,          # Scaling factor (alpha/r = 2)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
)

Why these choices?

r=64: Higher rank = more parameters = more capacity to learn tool routing patterns. I tested r=16 (too small) and r=64 (sweet spot).
All attention + MLP layers: Tool routing requires understanding query intent (attention) AND mapping it to structured output (MLP). Targeting only attention heads wasn't enough.
alpha=128 (2×r): Standard scaling that prevents gradient instability.

Step 3: The Ablation Study

This is where the project goes from "I fine-tuned a model" to "I systematically evaluated design choices." I ran 4 experiments:

Run	Base Model	LoRA Rank	LR	Result
1	Mistral-7B-Instruct-v0.3	16	2e-4	78.4%
2	Mistral-7B-Instruct-v0.3	64	2e-4	81.7%
3	Qwen2.5-7B-Instruct	16	2e-4	83.1%
4	Qwen2.5-7B-Instruct	64	2e-4	86.2%

All tracked on Weights & Biases.

Key Findings

1. Qwen > Mistral for tool routing (+4.5%)

Qwen2.5-7B-Instruct has stronger structured output capabilities out of the box. Its chat template naturally handles tool-call JSON, while Mistral required more prompt engineering to produce valid output.

2. r=64 > r=16 for both models (+3-4%)

The routing task isn't trivial — the model needs to learn mappings between natural language patterns and 9 discrete tool categories plus argument extraction. r=16 underfits.

3. Eval loss converges by epoch 2

All runs showed minimal improvement after epoch 2, with some showing slight overfitting in epoch 3. load_best_model_at_end=True was essential.

Step 4: Integration

The integration into the autonomous agent was designed as a feature flag — zero behavior change in production unless explicitly enabled:

# In executor.py
if is_toolforge_available():
    decision = toolforge_classify(query, memory_hits, has_memory)
    router_source = "toolforge"

if decision is None:
    decision = classify_query(query, memory_hits, has_memory)  # heuristic fallback

The toolforge_classify() function:

Loads the LoRA adapter lazily on first query
Runs inference with greedy decoding (deterministic routing)
Parses the model's tool-call output
Maps specific tools to the agent's decision types (web_search → needs_search, no tool → direct_answer)
Returns None on any failure → heuristic takes over

This means:

Production (HF Spaces, CPU): heuristic runs as before
GPU-enabled environments: ToolForge model handles routing
The code is always visible: interviewers can see the integration pattern

Results

Metric	Heuristic (Regex)	ToolForge (QLoRA)
Overall Accuracy	~75%	86.2%
Approach	200 lines of regex	Fine-tuned Qwen2.5-7B
Latency	0ms (regex)	~200ms (GPU)
Handles edge cases	❌ Constant regressions	✅ Learned from data
Maintenance cost	High (new regex per bug)	Low (retrain on new data)

The 15% accuracy improvement isn't just a number — it means:

"Say hello in Japanese" → correctly routes to translate (was: missed entirely)
"sqrt(44567)" → correctly routes to calculator (was: matched "what" → search)
"Compare React vs Vue for 2026" → correctly routes to autonomous_task (was: partial match → direct answer)

What I'd Do Differently

More data: 1.1K examples is enough for proof-of-concept, but 5K+ would likely push accuracy above 90%. The distillation pipeline can scale — I just ran out of free API quota.
Argument extraction evaluation: I evaluated tool selection accuracy but didn't formally measure argument extraction quality (e.g., did the model extract "Tokyo" from "weather in Tokyo?"). The traces show it works, but a proper F1 metric would be stronger.
GGUF quantization for CPU inference: The current serving path requires GPU. Converting to GGUF and using llama.cpp would enable CPU inference at ~1-2s latency — viable for production on free-tier hosting.

The Story

This project isn't about fine-tuning. Fine-tuning is a technique — anyone can run SFTTrainer. The story is:

I built an agent with hand-crafted routing
I measured where it failed (75% accuracy, constant regex regressions)
I generated training data using teacher distillation from my own pipeline
I trained and compared models with systematic ablation studies
I proved it works with quantitative evaluation (86.2% accuracy)
I integrated it as a production-ready feature flag

That's not a tutorial project. That's the ML engineering loop — identify problem → collect data → train → evaluate → deploy.

DEV Community