DEV Community

张逸群
张逸群

Posted on

1

GRPO踩坑记录2

Image description

这次我更换了任务,结合了MMLUPro、MELD和MATH三个数据集,并统一使用了相同的系统提示,模型仍然采用Qwen2.5-0.5B-Instruct (足够小,且不需要冷启动)。

PROMPT=(
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
    "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed with <think> </think> and <answer> </answer> tags, respectively, i.e.., "
    "<think> reasoning process here </think> <answer> answer here </answer>"
)
Enter fullscreen mode Exit fullscreen mode

奖励函数分为两种:第一种是 format_reward,顾名思义,用于检测输出是否符合指定的格式 <think> </think> <answer> </answer>;第二种是 accuracy_reward,用于检测输出的准确性。
从训练日志中可以清楚地看到,format_reward经历了一次显著的提升(这是否可以被称为一种涌现现象?),而accuracy_reward则一直在大幅波动,没有明显的上升趋势。这表明,模型只是通过“模仿”特定格式来提高得分,而没有通过先思考后行动的方式来显著提升任务性能。
我测试了一道题目,能够看到模型只是在拙劣的模仿这种格式,而没有学会真的思考。

<think> 
First, we need to understand what's happening when we "double" a number. Doubling a number means multiplying it by 2, so if we have a number x, doubling it will be represented as 2x.
</think>

<answer>Let's denote the number as x. According to the problem, doubling the number and then increasing it by 5 gives 15.

According to our formula, it would look like this: 
\[ 2x + 5 = 15 \]

Now we need to solve for x by first subtracting 5 from both sides of the equation:
\[ 2x + 5 - 5 = 15 - 5 \]
\[ 2x = 10 \]

Afterwards, we divide both sides by 2 to solve for x:
\[ \frac{2x}{2} = \frac{10}{2} \]
\[ x = 5 \]

Therefore, the number in question is 5.</answer>
Enter fullscreen mode Exit fullscreen mode

分析了一下,原因可能有两个:(1)模型尺寸太小;(2)奖励函数设置有问题。

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay