DEV Community

Thanh Lam Hoang
Thanh Lam Hoang

Posted on

Fine-tuning SmolAgents using Tools with Reinforcement Learning

When running SmolAgents CodeAct for tool calling, we often observe that smaller open-source models struggle with complex tool-use tasks — and sometimes even fail at simple ones. While careful prompt engineering can mitigate this problem, it’s not a sustainable solution, especially in dynamic agentic systems where any workflow change can disrupt tool-calling accuracy.

To address this issue at its core, the ideal approach is to train/fine-tune models to use tools effectively. However, this is a non-trivial task that requires setting up complex machine learning pipelines tightly integrated with the agentic system — something that can be challenging for most developers.

To make this process easier, we’ve developed a lightweight open-source library that removes the need to build these pipelines from scratch with MIT license for more information https://github.com/ToolBrain/ToolBrain

✨ Key Features

🤖 Learning algorithms: Supports GRPO, DPO, and supervised learning.
🎯 Flexible rewards: Define your own reward functions or use LLM-as-judge.
🔧 Tool management: Scalable retrieval for managing large tool collections.
📊 Knowledge distillation: Distill large teacher models into smaller student models for efficiency.
🚀 Zero-learn: Automatically generate training tasks.
⚡ Efficient training: Supports FP16 finetuning, LoRA, Unsloth, and BitsAndBytes for resource-efficient training.
🧠 Multiple agent frameworks: Supports SmolAgent and LangChain, with more coming soon.

Examples

from smolagents import tool, TransformersModel, CodeAgent
from toolbrain import Brain
from toolbrain.rewards import reward_exact_match

# --- 1. Define Tools and Reward Function (User-defined) ---
@tool
def add(a: int, b: int) -> int:
    """
    Add two integers.

    Args:
        a (int): First addend.
        b (int): Second addend.

    Returns:
        int: Sum of a and b.
    """
    return a + b


# --- 2. Prepare Training Data ---
training_dataset = [
    {
        "query": "Use the add tool to calculate 5 + 7",
        "gold_answer": "12"
    }
]


# 3. Create agent
model = TransformersModel(
    model_id="Qwen/Qwen2.5-0.5B-Instruct",  # use a bigger model for better results
    max_new_tokens=128
)

agent = CodeAgent(
    model=model,
    tools=[add],
    max_steps=1
)

# 4. Create Brain

brain = Brain(
    agent,                          # Agent instance
    algorithm="GRPO",                # Algorithm choice
    reward_func=reward_exact_match  # A reward function, you can customise any python function as reward
)

# 5. Train the agent with GRPO steps
brain.train(training_dataset, num_iterations=10)

Enter fullscreen mode Exit fullscreen mode

Results
The following plot illustrates how ToolBrain enhances the tool usage accuracy of the small Qwen/Qwen2.5-0.5B-Instruct model after just 20 training steps using GRPO.

Top comments (0)