How I replaced 200 lines of regex with a fine-tuned 7B model — and why it was worth it.
The Problem
I built an autonomous AI agent with 9 tools: web search, calculator, weather, Wikipedia, translation, and more. The first question every request must answer is deceptively simple:
Which tool should I use?
My first solution was a heuristic classifier — a function called classify_query() that uses regex patterns to detect intent:
# 200+ lines of patterns like this:
_SEARCH_INDICATORS = re.compile(
r"\b(latest|current|news|today|recent|who won|score|price|"
r"stock|update|happening|trending|release|launched)\b", re.IGNORECASE
)
_KNOWLEDGE_INDICATORS = re.compile(
r"\b(explain|what is|how does|define|difference between|"
r"why do|concept of|overview|meaning of|works)\b", re.IGNORECASE
)
It worked. About 75% of the time.
The remaining 25% was a graveyard of edge cases: "say hello in Japanese" (needs translate, matched nothing), "what's 15% of 2850" (needs calculator, matched what's → routed to search), "compare React vs Vue" (needs autonomous executor, matched compare → routed to direct answer).
Every fix introduced new regressions. Regex-based routing doesn't scale.
The Idea
What if the model itself could learn the routing? Not a giant foundation model — a small, fast 7B model fine-tuned specifically for this task. The hypothesis:
A QLoRA-adapted 7B model trained on 1K high-quality tool-call traces should outperform hand-crafted regex, with comparable latency.
This became ToolForge.
Step 1: Generating Training Data (The Hard Part)
I had 9 tools but no labeled dataset. Creating one manually would take weeks. Instead, I used teacher distillation — using a stronger model (Gemini 2.5 Flash) to generate high-quality training examples.
The Distillation Pipeline
User queries (generated) → Gemini 2.5 Flash → Structured tool-call traces → Filtered dataset
The trick was diversity. I needed queries covering:
- Single-tool requests ("What's the weather in Tokyo?")
- Multi-tool chains ("What's the weather in Tokyo and convert 25°C to Fahrenheit?")
- No-tool queries ("Explain recursion")
- Ambiguous queries ("Tell me about Python" — search or direct answer?)
- Edge cases ("sqrt of 44567" — calculator, not search)
I built a ClientPool that rotates across 6 free-tier Gemini API keys to avoid rate limits:
class ClientPool:
"""Round-robin pool of (key, model) slots for maximum throughput."""
def next_client(self):
# Pick the slot that has rested the longest
best = min(self._slots, key=lambda s: s.last_used)
elapsed = time.time() - best.last_used
if elapsed < self._min_gap:
time.sleep(self._min_gap - elapsed)
return best
After filtering for quality (valid JSON, correct schema, no hallucinated tools), I had 1,173 clean examples — enough for fine-tuning.
Dataset Distribution
| Tool | Count | % |
|---|---|---|
web_search |
287 | 24% |
calculator |
156 | 13% |
weather |
143 | 12% |
translate |
132 | 11% |
wikipedia |
128 | 11% |
no_tool |
119 | 10% |
dictionary |
78 | 7% |
datetime |
68 | 6% |
unit_converter |
62 | 5% |
The distribution is intentionally skewed toward web_search — mirroring real-world query patterns.
Step 2: Training with QLoRA
I trained on a Kaggle T4 GPU (free tier). The key insight: you don't need an A100 for fine-tuning. QLoRA with 4-bit NF4 quantization fits a 7B model in ~6GB VRAM.
Configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Double quantization saves ~0.4GB
)
lora_config = LoraConfig(
r=64, # LoRA rank
lora_alpha=128, # Scaling factor (alpha/r = 2)
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
)
Why these choices?
- r=64: Higher rank = more parameters = more capacity to learn tool routing patterns. I tested r=16 (too small) and r=64 (sweet spot).
- All attention + MLP layers: Tool routing requires understanding query intent (attention) AND mapping it to structured output (MLP). Targeting only attention heads wasn't enough.
- alpha=128 (2×r): Standard scaling that prevents gradient instability.
Step 3: The Ablation Study
This is where the project goes from "I fine-tuned a model" to "I systematically evaluated design choices." I ran 4 experiments:
| Run | Base Model | LoRA Rank | LR | Result |
|---|---|---|---|---|
| 1 | Mistral-7B-Instruct-v0.3 | 16 | 2e-4 | 78.4% |
| 2 | Mistral-7B-Instruct-v0.3 | 64 | 2e-4 | 81.7% |
| 3 | Qwen2.5-7B-Instruct | 16 | 2e-4 | 83.1% |
| 4 | Qwen2.5-7B-Instruct | 64 | 2e-4 | 86.2% |
All tracked on Weights & Biases.
Key Findings
1. Qwen > Mistral for tool routing (+4.5%)
Qwen2.5-7B-Instruct has stronger structured output capabilities out of the box. Its chat template naturally handles tool-call JSON, while Mistral required more prompt engineering to produce valid output.
2. r=64 > r=16 for both models (+3-4%)
The routing task isn't trivial — the model needs to learn mappings between natural language patterns and 9 discrete tool categories plus argument extraction. r=16 underfits.
3. Eval loss converges by epoch 2
All runs showed minimal improvement after epoch 2, with some showing slight overfitting in epoch 3. load_best_model_at_end=True was essential.
Step 4: Integration
The integration into the autonomous agent was designed as a feature flag — zero behavior change in production unless explicitly enabled:
# In executor.py
if is_toolforge_available():
decision = toolforge_classify(query, memory_hits, has_memory)
router_source = "toolforge"
if decision is None:
decision = classify_query(query, memory_hits, has_memory) # heuristic fallback
The toolforge_classify() function:
- Loads the LoRA adapter lazily on first query
- Runs inference with greedy decoding (deterministic routing)
- Parses the model's tool-call output
- Maps specific tools to the agent's decision types (
web_search→needs_search, no tool →direct_answer) - Returns
Noneon any failure → heuristic takes over
This means:
- Production (HF Spaces, CPU): heuristic runs as before
- GPU-enabled environments: ToolForge model handles routing
- The code is always visible: interviewers can see the integration pattern
Results
| Metric | Heuristic (Regex) | ToolForge (QLoRA) |
|---|---|---|
| Overall Accuracy | ~75% | 86.2% |
| Approach | 200 lines of regex | Fine-tuned Qwen2.5-7B |
| Latency | 0ms (regex) | ~200ms (GPU) |
| Handles edge cases | ❌ Constant regressions | ✅ Learned from data |
| Maintenance cost | High (new regex per bug) | Low (retrain on new data) |
The 15% accuracy improvement isn't just a number — it means:
- "Say hello in Japanese" → correctly routes to
translate(was: missed entirely) - "sqrt(44567)" → correctly routes to
calculator(was: matched "what" → search) - "Compare React vs Vue for 2026" → correctly routes to
autonomous_task(was: partial match → direct answer)
What I'd Do Differently
More data: 1.1K examples is enough for proof-of-concept, but 5K+ would likely push accuracy above 90%. The distillation pipeline can scale — I just ran out of free API quota.
Argument extraction evaluation: I evaluated tool selection accuracy but didn't formally measure argument extraction quality (e.g., did the model extract "Tokyo" from "weather in Tokyo?"). The traces show it works, but a proper F1 metric would be stronger.
GGUF quantization for CPU inference: The current serving path requires GPU. Converting to GGUF and using llama.cpp would enable CPU inference at ~1-2s latency — viable for production on free-tier hosting.
The Story
This project isn't about fine-tuning. Fine-tuning is a technique — anyone can run SFTTrainer. The story is:
- I built an agent with hand-crafted routing
- I measured where it failed (75% accuracy, constant regex regressions)
- I generated training data using teacher distillation from my own pipeline
- I trained and compared models with systematic ablation studies
- I proved it works with quantitative evaluation (86.2% accuracy)
- I integrated it as a production-ready feature flag
That's not a tutorial project. That's the ML engineering loop — identify problem → collect data → train → evaluate → deploy.
Links
- ToolForge repo: github.com/ayushh0110/toolforge
- Autonomous Agent: github.com/ayushh0110/autonomous-agent
- W&B Dashboard: wandb.ai/shekharayush56-cognizant/toolforge
- Live Agent Demo: autonomous-agent-one.vercel.app
Built by Ayush Shekhar. If you're working on tool-use fine-tuning, I'd love to hear what approach you're taking — reach out on LinkedIn.
Top comments (0)