Zhaopeng Xuan

Posted on Jan 6 • Edited on Jan 9

P1- Agent0 中的curriculum agent精读

#agents #ai #machinelearning #testing

原论文：Agent0: Unleashing Self-Evolving Agents from Zero Data
via Tool-Integrated Reasoning

原代码：https://github.com/aiming-lab/Agent0

目标：从Quality Engineering的角度来理解Agent0的curriculum agent是如何训练的，是否可以用作agent/agent workflows的自动化动态测试中？

论文中的亮点

论文的论点是建立2个agent不断的进行螺旋上升(Virtuous Cycle)来提升问题处理的能力，且不通过人工标注数据（aka Zero Data）：1个agent是curriculum agent来生成“问题”，另外一个executor agent来生成对应的“答案”，curriculum agent 的目的是在t时刻尽最大努力找到可以让t-1时刻的executor agent迷惑的问题，然后executor agent尽最大努力解决t时刻curriculum agent生成的问题，生成t时刻的weight(policy)，如此循环往复，curriculum agent不断的找到executor agent的能力边缘，而executor agent不断的提升自己的能力范围。

Curriculum agent的亮点

作为QA，我主要关心如何可以找到executor agent的能力范围，并且在能力范围的边缘或能力范围空间内的黑点处生成有效的test case，发现quality degradation.

Curriculum agent主要的工作流程是：

S1 - 使用LLM L1 作为初始模型/权重：在代码中使用的是Qwen/Qwen3-4B-Base,
S2 - 开始以LLM L1 当前权重进行强化学习
S3 - 强化学习 - 在LLM L1中根据以下prompt生成X个predicts，包含问题和参考答案

[
    {
        "role": "system",
        "content": (
            "You are an expert competition-math problem setter.\n"
            "FIRST, in your private scratch-pad, think step-by-step to design a brand-new, non-trivial problem. "
            "The problem could come from any field of mathematics, including but not limited to algebra, geometry, number theory, combinatorics, prealgebra, probability, statistics, and calculus. "
            "Aim for a difficulty such that fewer than 30 % of advanced high-school students could solve it. "
            "Avoid re-using textbook clichés or famous contest problems.\n"
            "THEN, without revealing any of your private thoughts, output **exactly** the following two blocks:\n\n"
            "<question>\n"
            "{The full problem statement on one or more lines}\n"
            "</question>\n\n"
            r"\boxed{final_answer}"
            "\n\n"
            "Do NOT output anything else—no explanations, no extra markup."
        ),
    },
    {
        "role": "user",
        "content": (
            "Generate one new, challenging reasoning question now. "
            "Remember to format the output exactly as instructed."
        ),
    },
]

S4 - 强化学习 - 将新生成的questions发给executor agent去做（没有参考答案），一个问题做N遍，获得N个回答
S5 - 强化学习 - 对每一个问题x和N个答案计算reward，计算advantage, 然后使用GRPO更新权重
重复

Curriculum agent的奖励函数

R(format): 格式正确性
R(Uncertain): self-consistency, 同一问题多个答案的自身一致性
R(tool_usage): 是否频繁调用外部工具
-R(repetition): 生成的问题的重复性

在不考虑EA的情况下将CA单独训练的潜在风险

1. 错误的参考答案（golden anwser）

以下代码出现在start_vllm_server_tool.py，当CA在训练过程中，会使用这个代码来在vLLM上启动EA, 那么当EA在当前权重下获得答案后，会来计算self-consistency 的score。注意：这并不是在EA训练过程中发现的，提出以下这一点仅是当我们需要将CA单独使用的情况下。在EA的训练过程中，EA读取的是CA训练后生成的问题，而EA训练过程中不参考CA提供的anwser，而是再次对同一个问题生成多个anwser，然后通过majority vote生成peusdo-label来训练EA， EA的训练过程不在本文的讨论范围内。

curriculum agent 必须比 executor agent聪明，在上面提到S3这一步，如果生成的参考答案（golden answer)不正确，而executor agent生成的答案是正确的，在这种情况下，无法生成正确的score(如果参考答案 != majority_ans，即使计算出了score, 也会设为0), 以下是Agent0中对应的代码

    return {
        ...
        'score':    score if grade_answer(majority_ans, golden_answer) and score > 0.1 else 0,
        ...
    }

扩展Curriculum agent的挑战

1. self-consistency的计算

Reward函数中的其中一个重要的函数是通过回答的不确定性来确定execute agent的能力边缘, 即选择问题的self-consistency 在[0.3，0.8]之间的问题（不是太简单，不是太难，而是围绕着让EA困惑的问题）

而这个函数的计算方式在agent0中是找出答案中的majority vote, 假设对同一个问题x1,有回答[y1, y2, y3, y4], 例如问题x1是“1+1等于几”，回答是[2,2,3,2], 那么majority vote 就是“2”，那么self-consistency是：

                           3 / 4 = 0.75
            R(uncertain) = 1 - 2 * | 0,75 - 0.5 | = 0.5

问题来了，如果不是客观题，而是主观题，假设问题x2是“今天荷兰天气如何？”，有回答["天气不错"， “下雨”， “小雨”， “多云转小雨”]，那么在当前的agent0代码基础上就无法直接计算self-consistency。

2. 工具调用的激励

在CA中，另外一个重要的激励函数是工具的调用数量，在论文中，在[0,4]次的工具调用，会得到工具调用的激励，重要的一点是，工具调用的数量是从CA生成的question/task的文本中来提取的，而不是从EA执行后的trajectory中提取，这种在question/task进行显示的工具调用描述，有一个很大的优势就是CA不会因为EA的能力缺陷而塌缩，假设如果我们从EA执行的trajectory中提取工具调用，而EA没有执行某个工具的能力，那CA无法得到奖励。我认为这里需要一种结合的方式，可能会更优。

# 以下是Agent0中的实际实现的代码，在CA计算reward的时候

def calculate_tool_reward(predict: str, weight: float = 0.05, cap: int = 4) -> float:
    if not predict:
        return 0.0

    tool_call_count = len(re.findall(r"\`\`\`output", predict))

    capped_calls = min(tool_call_count, cap)

    return capped_calls * weight

3. multi-turn 对话

当CA在训练的时候，Agent0的论文建议让EA使用特殊的system prompt来进行多轮对话而不是直接回答问题，目的是让EA有更好的执行结果，促使CA可以进化的更好。在CA训练的时候，因为EA其实是一个LLM的wrapper, 他没有工具调用，因此以下prompt做了特殊的处理，强制让EA在需要调用工具时（执行python）输出需要执行的代码。

# 这个prompt是来自Agent0的代码，这个和论文中A.2 Prompt不太一样，但是目的一样

SYSTEM_PROMPT = (
    "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process.\n"
    "First, provide your reasoning and write a self-contained Python code block wrapped in \`\`\`python ... \`\`\` to help you calculate the answer. You must use the `print()` function to output the results.\n"
    "After you write the code block, STOP. I will execute it for you.\n"
    "I will then provide the output under 'Code execution result:'. You must use this result (even if it's an error) to continue your reasoning and provide the final answer.\n"
    "The final answer must be enclosed in \\boxed{...}."
    "Code Format:\n"
    "Each code snippet is wrapped between \`\`\`. You need to use print() to output intermediate results.\n"
    "Answer Format:\n"
    "The last part of your response should be: \\boxed{...}"
)

4. vLLM 的应用

在executor agent推理答案过程中，在agent0的例子中，它使用verl在vllm中进行批量推理来批量获得问题的答案，这全部得益于vLLM。假设executor agent的LLM基座不在vLLM，而是SOAT 的模型（例如Gemini 3), 更或者这个executor agent不是一个单一agent,而是一个agent workflows, 那么我们将承担比较大的时间成本（executor agent的运行）和金钱成本（token的消耗），还有可能它触发rate limit.

实验

我选择使用Runpod来只训练curriculum agent，有以下问题需要注意

对于curriculum agent和executor agent, 我使用Qwen/Qwen3-0.6B-Base， 2x32G的CUDA devices, 将2个agent分别隔离在各自的CUDA Device
如果使用Qwen/Qwen3-4B-Base, 那么curriculum agent的训练至少需要48G

DEV Community