DEV Community: Zhaopeng Xuan

S1 - agent0 curriculum agent / AgentEvolver Self-question/CuES在测试用例生成时的比较

Zhaopeng Xuan — Mon, 12 Jan 2026 14:50:21 +0000

方面	Agent0 Curriculum Agent	AgentEvolver Self-Question	CuES
环境信息依赖	系统角色定义：需要人工定义一个系统角色，例如“你是一个数学专家，出数学相关的题目”	环境配置信息：需要人工定义一个环境的profile，例如包含环境里的entities，并且需要参考环境中的seed_task作为示例task(并不直接执行seed_task)	环境的概念池：需要从环境中的seed tasks中提取概念
环境锚定策略	n/a	在生成探索prompt时，使用env_profile作为环境的描述。并且使用环境中的种子问题作为在探索时的示例问题：例如“用户可能会问： [种子问题]”	在生成探索prompt时，环境内自带的system prompt将会被用于环境的描述
task合成路径	直接通过被训练的基座模型前向传播，生成包括task和预测的final anwser	4阶段： - 探索阶段：根据env_profile来引导生成不同的trajectories - 合成阶段：根据env_profile和trajectories来反推出sythetic tasks(包含query, confidence, ground_truth/action_sequence) - 过滤阶段：过滤掉重复的任务/没有预测答案的任务，最后无法执行或执行失败的任务 - 改写ground_truth阶段：在过滤时，如果任务成功执行，则通过真实执行的traj改写合成任务的ground_truth	4阶段: - 探索阶段：根据concept pool和人工探索方向来为每一步生成3元组合(s, a, o) - 合成阶段：根据同一个env_id下生成的多个[（s1,a1,o1），(s2,a2,o2)]三元组合反推出sytehtic tasks(包含 query, description, confidence, ground_truth/action_sequence) - 过滤阶段: 过滤掉无法执行或执行失败的任务 - 扩展阶段：将成功执行的sythetic task的query进行相同语意下的难易度改写，扩展合成task的数量
引导信号	根据被测agent的反馈，计算4种reward的信号： - reward of unconsistency - reward of correct format - reward of tool_usage - penalty of repetit questions 使用GRPO更新curriculum agent的权重(policy)来引导他创建出新的task 循环往复，不断根据reward提升task的生成能力	使用高温LLM在规定的env_profile的指引下进行探索，在探索时期，通过prompt控制广度探索和深度探索在生成阶段通过trajectory, 之前找到的新task, 还有env_profile来引导	通过conept pool和人工提供的探索方向，引导在规定的范围内探索，通过exploration_memory优先探索之前未探索的在生成阶段，通过env_description(沙盒的system prompt) 和exploration_memory还有triplets来引导生成在扩展阶段，根据prompt进行相同语义不同难度的改写合成的task的
难易程度控制	通过self-consistency的阈值来控制	通过在prompt中的exploration principals来控制，从广度到深度的扩展	通过对成功执行的sythetic task的query进行相同语义不同难度的改写
任务演化逻辑	Adversirial RL自学习，通过GRPO计算advantage, 进行反向传播	静态生成，通过不断生成的trajectory作为上下文，来提供经验来学习	静态生成后，通过exploration memory和triplets作为上下文，来提供自我学习，然后动态通过query改写的方法提供不同难度的任务(任务语义不变)
有效测试的定义	让被测agent感到模糊的测试用例	物理上可执行且符合难易程度的ground_truth 用例	物理上可执行且符合难易程度的ground_truth 用例

P3 - CuES 合成训练数据集的方式在agent测试中的应用

Zhaopeng Xuan — Mon, 12 Jan 2026 12:12:10 +0000

论文：链接
代码：链接

CuES中的探索

CuES希望人类可以输入一个探索的方向（例如一句话），并且交互沙盒可以提供一套概念池，在具体实现中，这个概念池的创建是通过读取交互沙盒中100个种子任务，让llm提取这种子task里名词/概念。这也就是CeUS所说的自上而下的引导。

CuES另外一个功能是自下而上的探索，实际上是让运行沙盒首先提供一个system prompt作为环境的初始观察Observation0, 然后考虑概念池和探索方向，在当前状况下，探索至多20步，生成至多20个triplets, 在20个连续的步骤结束后，他会重新使用LLM 对当前这个env_id/task_id下所有生成的triplets计算exploratory memory，这个memory将会被用在新任务合成阶段，作为指导。


history = [init_obs] #这里将沙盒提供的system prompt也作为历史
....
obs_str = env.step(action) #这一步将推理出的action传入沙盒执行
...

save_history = history[1:]
Triplet(
    env_id=env_id,
    history="\n".join(save_history=[-3:]),
    action=str(action),
    observation=obs_str
)

CuES中的任务合成

即通过上一步针对每一个task_id/env_id下，生成的所有triplets 和env_id/task_id对应的探索记忆，来合成新的任务，任务里同样包含预测trajectory steps, 这个预测的完成合成task的路径就会被保存作为ground_truth。

CuES合成任务的验证

同理，合成的任务将在对应的task_id/env_id下执行，这个这行也是multi-turn的执行，最终的主要验证合成任务的可执行性的标准是将执行的steps和合成的task传给一个llm, 即使用llm-as-a-judge的方式判定这个问题是否争取的完成

成功执行的合成任务再改写

注意，这里并不是改写之前成功执行的合成任务的ground_truth，而是改写执行成功的合成任务的query，来让他变难还是变简单，再次扩展出新的合成任务。

主要实现流程

P2 - AgentEvolver 的self-question 部分代码精读

Zhaopeng Xuan — Fri, 09 Jan 2026 17:21:33 +0000

论文：link
代码：link

目标：从Quality engineering的角度，理解AgentEvoler中self-question部分的实现，思考这个部分是否可以作为对Agent/Multi-Agent workflows的测试生成方法

AgentEvolver中的一个模块是self-question，总体思路是根据在一个可交互的沙盒中，根据人类编写的environment profile，通过可交互沙盒中自带的seed tasks，使用SOTA大参数模型(qwen-plus)来探索沙盒，探索的结果是沙盒的能力范围（trajectory），再将每一个seed task和对应沙盒的能力范围(trajectories)传入给一个更大的SOTA模型(qwen3-235b-a22b-instruct-2507)，让它通过能力范围反推出新的合成task（更广范围，更深层次)，再通过2次过滤（重复task, 无法执行的task），剩下可以执行的合成task, 这些可以执行的合成task的trajectory将会变成ground_truth（标准答案），因此在整个self-question阶段，没有任何对self-question的奖励机制。

因此在self-question的部分，核心逻辑就是用大模型来生成合成task, 这就是ground_truth, 来强化学习小模型(qwen-2.5-14B-Instruct)

以下是在self-question中每个阶段的重要步骤：

思考

1. 交互沙盒的特殊性和局限性

沙盒是有限的，但是实际用户的世界是无限的，合成的task仅局限于当前的沙盒，如果需要新的世界，那么需要建立不同的可交互沙盒，这增加了在测试过程中的成本。

除此之外，该文在self-question中的主要核心思想，是在SOTA LLM的基座上，如果能完成的task，他的trajectory就是ground truth, 也就是这一步是生成ground truth的步骤，但如果可交互沙盒内的状态机本来就有bug，这可能会导致生成的测试也不完全正确，那么对于可交互沙盒内的状态机还需要测试，也增加了单独使用self-question模块作为测试用例生成的成本和不确定性

2. 大模型到小模型的强化训练中task生成，这种思路是否可以移植到QA的测试用例生成？

部分可以，但是当测试agent / multi-agent workflow时，当前self-question生成的测试都是大模型已经可以完成的，如果大模型可以成功执行的task在小模型中执行失败，这种test case 可以发现小模型的局限性，但是别人可以说-为什么我的被测agent/multi-agent workflow 不直接使用大模型，这种测试在一定情况下是有用的，例如agent/multi-agent workflow由于成本和效率的原因，必须使用较小模型。

3. 双重成本

在探索可交互沙盒时，需要使用高温SOTA来实际运行, 而在过滤阶段，为了生成ground_truth, 我们让然需要再次使用SOTA来实际执行合成的task。

4. task生成的深度和广度

为了合成更广和更深层次的task, 必须需要使用极大的LLM基座，在本文中是qwen3-235b-a22b-instruct-2507，task生成的广度和深度完全由一个静态prompt控制，好处是整个生成过程不需要vLLM，但不太理想的地方是全程SOTA模型，且遗传生成的控制方法单一。

5. 在探索和回放时的Multi-turn 交互

在探索和过滤阶段，他都使用了multi-turn的方式，这样可以得到更准确的trajectory, 即更加精确的ground_truth。

P1- Agent0 中的curriculum agent精读

Zhaopeng Xuan — Tue, 06 Jan 2026 23:22:20 +0000

原论文：Agent0: Unleashing Self-Evolving Agents from Zero Data
via Tool-Integrated Reasoning

原代码：https://github.com/aiming-lab/Agent0

目标：从Quality Engineering的角度来理解Agent0的curriculum agent是如何训练的，是否可以用作agent/agent workflows的自动化动态测试中？

论文中的亮点

论文的论点是建立2个agent不断的进行螺旋上升(Virtuous Cycle)来提升问题处理的能力，且不通过人工标注数据（aka Zero Data）：1个agent是curriculum agent来生成“问题”，另外一个executor agent来生成对应的“答案”，curriculum agent 的目的是在t时刻尽最大努力找到可以让t-1时刻的executor agent迷惑的问题，然后executor agent尽最大努力解决t时刻curriculum agent生成的问题，生成t时刻的weight(policy)，如此循环往复，curriculum agent不断的找到executor agent的能力边缘，而executor agent不断的提升自己的能力范围。

Curriculum agent的亮点

作为QA，我主要关心如何可以找到executor agent的能力范围，并且在能力范围的边缘或能力范围空间内的黑点处生成有效的test case，发现quality degradation.

Curriculum agent主要的工作流程是：

S1 - 使用LLM L1 作为初始模型/权重：在代码中使用的是Qwen/Qwen3-4B-Base,
S2 - 开始以LLM L1 当前权重进行强化学习
S3 - 强化学习 - 在LLM L1中根据以下prompt生成X个predicts，包含问题和参考答案

[
    {
        "role": "system",
        "content": (
            "You are an expert competition-math problem setter.\n"
            "FIRST, in your private scratch-pad, think step-by-step to design a brand-new, non-trivial problem. "
            "The problem could come from any field of mathematics, including but not limited to algebra, geometry, number theory, combinatorics, prealgebra, probability, statistics, and calculus. "
            "Aim for a difficulty such that fewer than 30 % of advanced high-school students could solve it. "
            "Avoid re-using textbook clichés or famous contest problems.\n"
            "THEN, without revealing any of your private thoughts, output **exactly** the following two blocks:\n\n"
            "<question>\n"
            "{The full problem statement on one or more lines}\n"
            "</question>\n\n"
            r"\boxed{final_answer}"
            "\n\n"
            "Do NOT output anything else—no explanations, no extra markup."
        ),
    },
    {
        "role": "user",
        "content": (
            "Generate one new, challenging reasoning question now. "
            "Remember to format the output exactly as instructed."
        ),
    },
]

S4 - 强化学习 - 将新生成的questions发给executor agent去做（没有参考答案），一个问题做N遍，获得N个回答
S5 - 强化学习 - 对每一个问题x和N个答案计算reward，计算advantage, 然后使用GRPO更新权重
重复

Curriculum agent的奖励函数

R(format): 格式正确性
R(Uncertain): self-consistency, 同一问题多个答案的自身一致性
R(tool_usage): 是否频繁调用外部工具
-R(repetition): 生成的问题的重复性

在不考虑EA的情况下将CA单独训练的潜在风险

1. 错误的参考答案（golden anwser）

以下代码出现在start_vllm_server_tool.py，当CA在训练过程中，会使用这个代码来在vLLM上启动EA, 那么当EA在当前权重下获得答案后，会来计算self-consistency 的score。注意：这并不是在EA训练过程中发现的，提出以下这一点仅是当我们需要将CA单独使用的情况下。在EA的训练过程中，EA读取的是CA训练后生成的问题，而EA训练过程中不参考CA提供的anwser，而是再次对同一个问题生成多个anwser，然后通过majority vote生成peusdo-label来训练EA， EA的训练过程不在本文的讨论范围内。

curriculum agent 必须比 executor agent聪明，在上面提到S3这一步，如果生成的参考答案（golden answer)不正确，而executor agent生成的答案是正确的，在这种情况下，无法生成正确的score(如果参考答案 != majority_ans，即使计算出了score, 也会设为0), 以下是Agent0中对应的代码

    return {
        ...
        'score':    score if grade_answer(majority_ans, golden_answer) and score > 0.1 else 0,
        ...
    }

扩展Curriculum agent的挑战

1. self-consistency的计算

Reward函数中的其中一个重要的函数是通过回答的不确定性来确定execute agent的能力边缘, 即选择问题的self-consistency 在[0.3，0.8]之间的问题（不是太简单，不是太难，而是围绕着让EA困惑的问题）

而这个函数的计算方式在agent0中是找出答案中的majority vote, 假设对同一个问题x1,有回答[y1, y2, y3, y4], 例如问题x1是“1+1等于几”，回答是[2,2,3,2], 那么majority vote 就是“2”，那么self-consistency是：

                           3 / 4 = 0.75
            R(uncertain) = 1 - 2 * | 0,75 - 0.5 | = 0.5

问题来了，如果不是客观题，而是主观题，假设问题x2是“今天荷兰天气如何？”，有回答["天气不错"， “下雨”， “小雨”， “多云转小雨”]，那么在当前的agent0代码基础上就无法直接计算self-consistency。

2. 工具调用的激励

在CA中，另外一个重要的激励函数是工具的调用数量，在论文中，在[0,4]次的工具调用，会得到工具调用的激励，重要的一点是，工具调用的数量是从CA生成的question/task的文本中来提取的，而不是从EA执行后的trajectory中提取，这种在question/task进行显示的工具调用描述，有一个很大的优势就是CA不会因为EA的能力缺陷而塌缩，假设如果我们从EA执行的trajectory中提取工具调用，而EA没有执行某个工具的能力，那CA无法得到奖励。我认为这里需要一种结合的方式，可能会更优。

# 以下是Agent0中的实际实现的代码，在CA计算reward的时候

def calculate_tool_reward(predict: str, weight: float = 0.05, cap: int = 4) -> float:
    if not predict:
        return 0.0

    tool_call_count = len(re.findall(r"\`\`\`output", predict))

    capped_calls = min(tool_call_count, cap)

    return capped_calls * weight

3. multi-turn 对话

当CA在训练的时候，Agent0的论文建议让EA使用特殊的system prompt来进行多轮对话而不是直接回答问题，目的是让EA有更好的执行结果，促使CA可以进化的更好。在CA训练的时候，因为EA其实是一个LLM的wrapper, 他没有工具调用，因此以下prompt做了特殊的处理，强制让EA在需要调用工具时（执行python）输出需要执行的代码。

# 这个prompt是来自Agent0的代码，这个和论文中A.2 Prompt不太一样，但是目的一样

SYSTEM_PROMPT = (
    "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process.\n"
    "First, provide your reasoning and write a self-contained Python code block wrapped in \`\`\`python ... \`\`\` to help you calculate the answer. You must use the `print()` function to output the results.\n"
    "After you write the code block, STOP. I will execute it for you.\n"
    "I will then provide the output under 'Code execution result:'. You must use this result (even if it's an error) to continue your reasoning and provide the final answer.\n"
    "The final answer must be enclosed in \\boxed{...}."
    "Code Format:\n"
    "Each code snippet is wrapped between \`\`\`. You need to use print() to output intermediate results.\n"
    "Answer Format:\n"
    "The last part of your response should be: \\boxed{...}"
)

4. vLLM 的应用

在executor agent推理答案过程中，在agent0的例子中，它使用verl在vllm中进行批量推理来批量获得问题的答案，这全部得益于vLLM。假设executor agent的LLM基座不在vLLM，而是SOAT 的模型（例如Gemini 3), 更或者这个executor agent不是一个单一agent,而是一个agent workflows, 那么我们将承担比较大的时间成本（executor agent的运行）和金钱成本（token的消耗），还有可能它触发rate limit.

实验

我选择使用Runpod来只训练curriculum agent，有以下问题需要注意

对于curriculum agent和executor agent, 我使用Qwen/Qwen3-0.6B-Base， 2x32G的CUDA devices, 将2个agent分别隔离在各自的CUDA Device
如果使用Qwen/Qwen3-4B-Base, 那么curriculum agent的训练至少需要48G

Say Hello to Your New QA Teammate: E2E Test AI Agent

Zhaopeng Xuan — Mon, 25 Aug 2025 23:18:20 +0000

For some background, see The Past Story about using UI-Tars for AI Testing, which shows practical AI applications in test automation using UI-Tars and Midscene.js.

Last updated: 2025-09-11

What you’ll gain from this blog

1️⃣ You’ll see how AI can supercharge End-to-End testing — not in theory, but through real-world practical demos. I’ll walk you through with a real-world demo which solves a few day-to-day use cases, and demonstrate the following KEY features of this E2E Test AI-Agent:

Write tests entirely in natural language — no coding required
Create tests interactively, like pair-testing with a QA engineer
AI Agent handles all test data preparation(ex: users, product etc)
Generated test can be executed later as part of the existing Regression Test suite
Automatically heal broken element locators and test steps
Achieve higher stability with less flakiness than manually written scripts
Identify elements by image prompts
Run tests at the same speed as Playwright scripts

2️⃣ I will introduce the architecture of this E2E Test AI Agent on top of your own E2E automation framework with Midscene.js, to enable you can have your own Agent.

3️⃣ In conclusion, I’ll share future steps and upcoming challenges, drawing on insights gathered from a real workshop with a UX designer, a Product Manager, and a QA engineer.

🚀 At the end, you can build your own E2E Test AI Agent in your organization, and benefit from it as your new teammate, to achieve a Shift-lefted scalable, stable, fast, low-maintenance, and cheap AI-Based E2E test Solution.

Let's see the Demo first

00:00 Interactive generate a test by reading acceptance criteria
01:50 Run the generated test from the beginning

Use cases will be demonstrated via the E2E test AI Agent

These are the goals I would like to achieve by applying LLM in E2E automation testing.

Make E2E test automation writing and maintenance more efficient
- from ~30 minutes to write 2/3 similar E2E tests to 0 minutes
- from endless maintenance to a few hours of maintenance per week.
Shift E2E automation testing left
- from QA drafting test cases to reusing PM's acceptance criteria & UX's design as test cases
Unblock Product managers and UX designers in E2E automation testing
- from coding skills required to no coding skills required

Key features with Support Use cases

1. Principals

Imagine every Quality participant has an E2E Test AI Agent sitting beside you, listening closely as you walk through the acceptance criteria for each feature you want to validate, almost like pair testing with an AI partner.

As a human: you describe your acceptance criteria step by step, feature by feature, aligning them with the UX design, while collaborating with the AI Agent.

As the E2E Test AI Agent: it listens to each request in sequence, generating code along the way, and eventually assembling them into a complete, reusable E2E test script to be executed in the CI repeatedly.

2. Natural language creates tests interactively

The E2E Test AI Agent we developed at CreativeFabrica has knowledge of the application’s context and was able to interpret the contex such as what “Home page” means.

3. Handles Test data preparation

Unlike typical QA AI agents, the solution we developed integrates with our existing E2E automation framework. As a result, it can go beyond simple browser interactions to perform backend operations, such as preparing test data without relying on the UI.

4. Generated Less flaky & self-healing test code

Rather than simply translating natural language into Midscene.js calls, we designed our code generation to follow testing best practices. This way, we avoid the flaky behavior you’d often see with plain midscene.js, especially around async waits and lazy loading.

For example, when Human said: "Click the 1st Product", it will generate the following code:

      await aiWaitFor("1st product is visible")
      await Promise.all([
        cleanPage.waitForURL(url => {
          return url.href.includes("/product/autopub-graphic/")
        }),
        aiTap("1st product")
      ])

It leverages AI to identify and click the “1st product,” but the action is wrapped with our best-practice code to ensure the generated test remains stable while it is self-healing.

5. Locate elements by screenshot

Thanks to the new feature in Midscene.js, our E2E Test AI Agent can now use images as locators! Instead of relying only on traditional selectors, we can simply take a screenshot of the application or even a UI design and use it directly as a locator.

This makes it much easier to test complex systems, especially those built with Canvas or even AI-driven interfaces—where conventional locators often fall short.

6. Fast test execution & low cost relatively

Thanks to the Midscene.js caching mechanism, once a test case is generated, most elements used during AI actions are stored. This means we don’t need to call the LLM on every CI execution(save budget), only when elements change or when the planned steps are updated. The basic test execution is Playwright-based, you will enjoy the execution velocity from Playwright.

E2E Test AI Agent Architecture

At the heart of our design is a “plug-and-play” philosophy. We treat existing tools like midscene.js functions, Playwright functions, and our in-house E2E automation framework as interchangeable modules that the AI agent can call, almost like snapping together Lego blocks.

But we didn’t stop there. Both midscene.js and our framework are repackaged with best practices baked right into the code generation process, so the generated code is stable and reliable by design.

To boost the agent’s reasoning ability, we also added a dual-layer RAG setup:

The first layer helps the agent with semantic understanding from the Human's input.
The second layer supports LLM action planning of each step from the function aiAction() exposed by midscene.js

This way, the agent doesn’t just execute blindly, it understands the intent, makes a plan, and then carries it out with solid testing practices behind it.

Key Component/Layers

- Layer 1: From Human Language to Defined Tools
Translate natural-language acceptance criteria into structured actions and assertions that can be executed, and get different format outputs.

- Layer 2: Framework Integration
Wrap your existing E2E test framework, Midscene.js functions, and Playwright native functions as LLM tools, and keep all LLM Tools can share a single Playwright Browser Context.

- Layer 3: AI-Driven Planning
Midscene.js orchestrates the rest: planning AI steps, interpreting the current screenshot and HTML DOM, and deciding the next best action autonomously.

The test case metadata

In the current paradigm, a test case created and autonomously executed by an E2E-Test AI Agent contains 3 core metadata layers:

Acceptance Criteria (Human Input):
The intent, expressed directly in natural language by humans.
Executable Test Code (Auto-generated):
Playwright-based code that integrates seamlessly with your existing E2E framework.
Element Locator Cache (AI-generated by Midscene.js):
Cached mappings of HTML elements for fast execution. Only when the cache expires or is missing does the agent call the LLM again to resolve new locators.

With these three, a test case is no longer a static artifact, but a self-adaptive entity that balances human intent, system execution, and AI-assisted resilience.

Here are the exported files from the demo in the video:

My Conclusions

AI in E2E automation testing is poised to grow rapidly and inevitably expand into many other areas of testing. This is not just a possibility, but a clear and unstoppable trend.
With the current state of AI, relying solely on LLMs to handle complex automated testing is still challenging. However, when combined with existing E2E test automation frameworks, it can deliver a much better and more practical experience.
We will undoubtedly see more AI-powered practical testing tools emerging. But I strongly recommend using Midscene.js when it comes to integration with Playwright. We may soon see a new solution called Vibium (which, as I understand, is still under development and has been proposed by the creator of Selenium).
For QA engineering, this is more than just the development of a new tool, it represents a transformation in how we approach the entire testing process and quality management.

Challenges

I set up a workshop together with a QA engineer, a Product manager, and a UX designer to join a 1-hour workshop session to play with this E2E Test AI Agent with LLM Gemini 2.5-Flash. I collected the following challenges:

Intelligence: Can the system accurately understand the intent of a test step? To achieve this, we need well-defined acceptance criteria, along with clear guidelines on how to write acceptance criteria in a way that can be interpreted by an LLM.
Accuracy: Can the LLM’s vision correctly determine the coordinates of elements to interact with? When using gemini-2.5-flash, this model can accurately locate around 80% of larger HTML elements, but about 20% have coordinate deviations. For smaller elements, like checkboxes, it often fails to locate them entirely. To address this, we employ different models for different tasks: a smaller model for semantic analysis, a “deep-thinking” model such as DeepSeek R3 for planning, and a vision-optimized model like UI-Tars for precise element localization.
Step Interactivity, Repeatability, Reversibility, and Stability: Each interaction between the human and the AI should be repeatable, reversible, and stable. We observed that the AI sometimes performs unnecessary actions, which can still generate code. Humans may need to correct previous mistakes, so every step must be atomic to ensure reliability and proper rollback.

I’d love to hear from you!

Feel free to like, comment, or share this blog, your feedback means a lot!

Big thanks to these guys:

@software{Midscene.js,
  author = {Zhou, Xiao and Yu, Tao},
  title = {Midscene.js: Let AI be your browser operator.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}

Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent (Part 3)

Zhaopeng Xuan — Wed, 26 Feb 2025 09:54:27 +0000

Articles in this series:

This is the final article in this series.

I will present an example demonstrating:

How to integrate UI-Tars' system-reasoning 2 with our locally built RAG using Ollama and LangChain to create a system that understands high-level user instructions and automates execution based on browser screenshots at each stage.
Verify the capability of system-2 reasoning after combining the UI-Tars & local RAG

1. End-to-End Demo

This Demo uses Miro as an example to demonstrate the capability to handle a non-B2C & complicated system. The AI Agent follows a user's single instruction: "Create a new board with 2 sticky notes and link these 2 sticky notes by a line. "

⚠️VERY IMPORTANT⚠️: The Demo uses Miro's Free Plan, which is open to everyone who can use the Free plan. The test is executed less than 10 times to verify the stability. The authentication part uses my personal Miro Free account(hardcoded already to avoid any other risks). I strongly ask any readers who want to reproduce this test for any customer-facing products should NOT impact the normal usage of the product, and MUST follow up the policy of the product respectively.

(👀 Except Authentication, rest of actions are planed and executed by the AI-Agent with reading a single High-level User instruction)

2. The explanation

Before the test, we need to deploy UI-Tars-7B-SFT:

Step 1: Deploy UI-Tars-7B-SFT to Hugging Face, to L40S 1 x GPU, you can follow up the steps from here
Step 2: Config the .env file for Midscene

OPENAI_API_KEY="hf_" // this is can be the HF access key
OPENAI_BASE_URL="https://something/v1" // this can be HF Endpoint URL
MIDSCENE_MODEL_NAME="ui-tars-7b-sft" 
MIDSCENE_USE_VLM_UI_TARS=1 // must tell Midscene to switch to UI-Tars

MIDSCENE_LANGSMITH_DEBUG=1 // enable trace send to Langsmith, help us debug and test this AI Agent

LANGSMITH_API_KEY= 
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_TRACING_V2=true
LANGSMITH_PROJECT=

DEBUG=true //This will enable the OpenAI SDK to print Debug logs to the console

Step 3: Building own Ollama + Langchain environment locally, and pull the nomic-embed-text as embeddings, and a RAG locally which contains "very structured" documents.

The test did 3 steps:

Step 1: Authenticated, save the authentication state to the local file, it is implemented via writing the Playwright code by following the guides from Playwright Authentication
Step 2: Pass only one High-level user instruction to the ai(), let ai() handle to plan and execute it. This function is exposed by Midscene, a tool created by ByteDance.

# Passing only ONE user instruction to UI-Tars, UI-Tars only use this single instruction to do reasoning. 

        await ai(`**ID: ${ uuidv4() }, 这是一个新指令，完全忽略之前的记忆，严格按照指令和规则执行，禁止幻想**
Given A free plan user creates a new board without upgrade,
And the user creates 2 sticky notes in 2 different side of the grid, 
Then the user adds a line from 1st sticky note to the 2nd sticky note`)

Step 3: Use AI to assert the final state via aiAssert()

3. The Stage Conclusions

3.1 System-2 Reasoning Capability

Referenced Paper System-1 reasoning refers
to the model directly producing actions without chain-of-thought, while system-2 reasoning involves a more
deliberate thinking process where the model generates reasoning steps before selecting an action

In Part-1, it demonstrated an example by using Vinted. It shows a good result for System-1 reasoning by transferring the human language to a single browser action without chain-of-thought.

The above video demonstrated the ability of UI-Tars about System-2 reasoning by having a chain of thought, including our own local RAG.

Takeaway messages

comparing with gpt-4o, UI-Tars has better ability of system-2 reasoning. (data from - Part-2)
However, having proper reasoning from UI-Tars for a specific sector or a product requires a structured RAG or fine-tuning UI-Tars with our own input data. (data from - (RAG for UI-Tars)[https://github.com/web-infra-dev/midscene/issues/426])
For a B2C site, with a small RAG together with UI-Tars-7B-SFT, we can achieve a very high-level instruction - ex: I want to buy a bag in Vinted. (It has been tested against Vinted.com, but I cannot share the demo because the test was treated as a robot, and the action breached Vinted's policy.)

3.2 Applicable products and scenarios

This solution is already capable of serving as a supplement to existing regression E2E automated testing for B2C websites, even without building RAG, but it requires a few system-2 reasonings.
This solution is capable as part of automated exploratory testing for B2C applications, and others require RAG building or fine-tuning UI-Tars-SFT with their own structured business data.
This solution is capable of doing GUI and OCR checking to replace your current Screenshot Assertions within existing playwright/puppeteer tests. (Only if the purpose of your screenshot assertions is NOT comparing the UI style with given pictures)

3.3 Problems & Risk

You might be excited! It can already address real challenges in QA engineering, but it requires cautious use and ongoing development.

One of the problems in Quality Assurance engineering may migrate from "The automated test is flaky/outdated" to "The AI-decided Automated test result is NOT trustable".
For situations using RAG together with UI-Tars, it works stable with 7B-SFT, but doesn't work well(or at all) with 7B-DPO, so it requires more effort to make the solution be more scalable.
How to apply this AI-Agent as a Virtual QA engineer into existing SDLC faces challenges in terms of when and how, and the stability of the result, although it's already quite stable for System-reasoning 1 and partially for System-reasoning 2
If one of your purposes is to achieve the high velocity of test execution, then it may not work very well for complicated web applications, such as some SaaS application, when using too much System-2 reasoning, because UI-Tars has its own short-term and long-term memory, jumping to an "incorrect" cache will lead the result to the Moon...

Data - UI-Tars VS GPT-4o in Midscene (Part 2)

Zhaopeng Xuan — Mon, 24 Feb 2025 18:32:36 +0000

Articles in this series:

Reading Part 1 first could help to learn the context of the following data.

This is a series documentation. In Part 1, We have generally understood what UI-Tars LLM is and how Midscene orchestrates it. In this article, I mainly want to delve deeper into comparing UI-Tars and GPT-4o in AI-powered automation testing, via using Midscene, to identify the differences, pros, and cons.

By default, Midscene supports gpt-4o, qwen-2.5VL, and ui-tars since Feb 2025.

1. Comparing GPT-4o and UI-Tars results when having the same test case as inputs

We will analyze and compare different kinds of test steps.

1.1 Step 1 - AI Assertion

    await aiWaitFor('The country selection popup is visible')

	GPT-4o	UI-Tars:7B
return_format	Json_Object	Plain text, but Json String
Steps	1	1
duration	3.09s	2.27s 👍
cost	0.0035775$	0.0011$ 👍
token	1390	2924
temperature	0.1	0
Message	Browser Screenshot, Instructions	Browser Screenshot, Instructions
result	{"pass":true, "thought":"The screenshot clearly shows a country selection popup with a list of countries, confirming the assertion." }	{ "pass": true, "thought": "The country selection popup is visible on the screen, as indicated by the title 'Where do you live?' and the list of countries with their respective flags." }

1.2 Step 2 - Simple Action Step

    Select France as the country that I'm living

	GPT-4o	UI-Tars:7B
return_format	Json_Object	Plain text, formated
Steps	1	2
duration	3.22s 👍	4.54s(2.54s + 1.99)
cost	0.0109475$	0.0023$ 👍
token	4184	14987(5110 + 9877)
temperature	0.1	0
Message	Screenshot, Instructions, part of HTML Tree	Screenshot, Instructions
result	{ "actions":[ { "thought":"..", "type":"Tap", "param":null, "locate":{ "id":"cngod", "prompt":"..." } } ], "finish":true, "log":"...", "error":null }	Iterated 2 steps, every time it return Thought:.... Action: <click\

To click "France" in the country popup, GPT-4o only requires 1 LLM call due to GPT-4o returning the action and finish in one reply. But UI-Tars requires 2 LLM calls based on its reasoning, the first call returns the action, and then after the action is executed, the 2nd LLM call sends a screenshot again to *check whether the "user's instruction" is finished. *

1.3 Step - Complicated Step

Click Search bar, then Search 'Chanel', and press the Enter

	GPT-4o	UI-Tars:7B
return_format	Json_Object	Plain text, formated
Steps	1	4
duration	3.52s 👍	12.16s (4+2.77+2.36+3.03)
cost	0.01268$	0.00608$ 👍
token	4646	49123
temperature	0.1	0
Message	Screenshot, Instructions, part of HTML Tree	Screenshot, Instructions
result	Returns 3 actions in one reply { "actions":[ {...}, {...} ], "finish":true, "log":"...", "error":null }	Iterated 4 steps, every time it return Thought:.... Action: <click\

This step is a bit complicated for both, but they both can handle it, but with a big difference. There is no reasoning in GPT-4o's result, it generated 3 actions and marked finish as true before the action started... however UI-Tars is quite slow relatively due to its reasoning - it makes a single action, then reflects it and then decide next action, in total, it also generates 3 actions, plus one check at the end to verify the current stage meet the user's expectation based on the given "user's instruction"

1.4 Step - Scrolling to an unpredictable position

Scroll down to the 1st product

	GPT-4o	UI-Tars:7B
return_format	Json_Object	Plain text, formated
Steps	2	2
duration	7.58 (3.01+4.57)	5.47s (3.18+2.29) 👍
cost	0.01268$	0.002735$ 👍
token	12557	14994
temperature	0.1	0
Message	Screenshot, Instructions, part of HTML Tree	Screenshot, Instructions
result	{ "actions":[ {...} ], "finish":true, "log":"...", "error":null }	Iterated 4 steps, every time it return Thought:.... Action: <click\

Wow! Due to there being no reasoning in GPT-4o usages, to scroll to an unknown position requires 2 GPT-4o calls, 1st call generates a scrolling action, and the 2nd call checks the screenshot and then decides no need to scroll anymore.
But when playing with UI-Tars, it just works as normal, 1st call makes the decision of action, and the 2nd to validates again from the new screenshot after the browser action is executed.

1.5 “Vision” Comparision

UI-Tars and GPT-4o both have their own vision.

	GPT-4o	UI-Tars:7B
Vision

For some small or overlapping elements, GPT-4o is unable to recognize them, whereas UI-Tars:7B can achieve almost complete recognition. Unfortunately gpt-4o even cannot identify the "login" button in the top banner.

2. Summary

2.1 Ability to handle complicated case

*GPT-4o: * Because there is no reasoning or no system-2 reasoning, gpt-4o can handle some straightforward test steps, but for some very general steps such as i login, it roughly cannot handle this case at all.

UI-Tars:7B Because it has reasoning, it can support up to 15 steps, meaning 14 actions + 1 final check, which is his reasoning top limit.

2.2 Speed

In most cases, You have to double the LLM calls if you use UI-Tars comparing using GPT-4o, becauseUI-Tars` requires a final check for each "user instruction".

But gpt-4o in most cases doesn't require validation for "user instructions" after it generates actions.

So gpt-4o is faster, but may be dangerous for the test result.

2.3 Level of Trust

UI-Tars always check whether the previously planned action achieved the "user instruction", so UI-Tars traded time for accuracy, UI-Tars only marks finish=true when it checks the screenshot again after the action, but gpt-4o directly return finish=true even before the generated actions are executed...

2.4 Perception of GUI Element

From section 1.4, we can clearly notice that UI-Tars can even identify the small elements on the page and overlapped elements, but it failed for gpt-4o.

2.5 Additional input for LLM

UI-Tars only makes a decision based on the screenshot(mimicking human vision), but gpt-4o requires building a partial HTML tree, which may slow down gpt-4o due to the increasing size of the HTML code.

2.6 Costs

If we deploy UI-Tars to our own infrastructure, then to achieve the same result as gpt-4o or an even better result, you can save 50% - 75% costs.

3. Conclusions

Overall, if we plan to apply AI in real day-to-day work, I believe UI-Tars can do a better job than gpt-4o in the above context.

However, how to speed up UI-Tars's reasoning, will be one of the challenges in the near future.

Practical Applications of AI in Test Automation - Context, UI-Tars LLM , Midscene (Part 1)

Zhaopeng Xuan — Mon, 24 Feb 2025 12:30:23 +0000

Articles in this series:

This series of articles will provide a clear and practical guide to AI applications in end-to-end test automation. I will use AI to verify a product's end-to-end functionality, ensuring that it meets the required specifications.

Let's watch a demo first 👀👀, and then I will elaborate on how it works after.

(The video is not accelerated. I use Vinted.com as an example because I heard about it very frequently from my wife...

I tell the AI Agent that it must open the home page, search for a product, then open a product detail page, and check the price on the Product page.

The video above demonstrates how the AI Agent perceives the process—autonomously interpreting business-oriented test cases, evaluating the webpage's current state(screenshot), making plans and decisions, and executing the test. It engages in multi-step decision-making, leveraging various types of reasoning to achieve its goal.

1. Reviewing the Role of End-to-End Testing

It is crucial to emphasize once again that: the primary goal of end-to-end testing is to validate that new features and regression functionality match the product requirements and design by simulating the customers' behavior.

End-to-end testing is a testing approach widely used in regression validation. It can be performed either manually—such as by writing a sanity checklist and executing tests manually- or through automation by writing test scripts using tools like Playwright or Appium.

The three key aspects of end-to-end testing are described in the above figure:

Understand How Users Benefit from the Feature – Identify the value the feature brings to users.
Design Test Cases from the User's Perspective – Create test scenarios that align with real user interactions.
Iterate Test Execution During Development – Continuously run test cases throughout the development process to verify that the implemented code meets the required functionality.

2. In Action – Executing Your Test Cases with an AI Agent

In traditional end-to-end test automation, the typical approach is as follows:

Analyze the functionality – Understand the feature and its expected behavior.
Analyze and write test cases – Define test scenarios based on user interactions and requirements.
Write automation scripts – Implement test cases using automation frameworks.

When writing automated test cases, we usually create Page Object-like classes to represent the HTML tree, allowing the test script to interact with or retrieve elements efficiently.

Now, let's see how an AI Agent can optimize this process.

2.1 - A Multiple-decisions AI Agent can optimize the test process

From the above video, this is the test case that AI Agent read:

Scenario: Customer can search and open product detail page from a search result

Go to https://www.vinted.com
The country selection popup is visible
Select France as the country that I'm living
Accept all privacy preferences
Search 'Chanel', and press the Enter
Scroll down to the 1st product
Click the 2nd product from the 1st row in the product list
Assert Price is visible

Thus, the AI Agent actually read some business-oriented languages, instead of something like "Click A"/ "Type B". The AI Agent itself will make reasoning and plan the steps.

To run this test, it requires the following hardware:

Nvidia L40s - 1 x GPU, 48GB GPU Memory, 7 x vCPU, 40GB CPU memory

2.2 - What problems were solved by this AI Agent

Reflecting on what we mentioned in Section 1, end-to-end testing can be performed in two ways: by writing automated test scripts or executing tests manually.

With the introduction of an AI Agent, a new approach emerges—simply providing your test cases to the AI Agent without writing test scripts. The AI Agent then replaces manual execution by autonomously carrying out the test cases.

Specifically, it addresses the following problems:

Reduces Manual Testing Costs – The AI Agent can interpret test cases written by anyone, eliminating the need to write test scripts and allowing tests to be executed at any time.
Lowers Test Script Maintenance Effort – The AI Agent autonomously determines the next browser action, reducing the need to modify tests for minor UI changes.
Increases Accessibility and Participation – Shifting from traditional QA engineers writing automation scripts to a decentralized model where developers contribute, and now to a stage where anyone proficient in English and familiar with the product can write end-to-end test cases.

2.3 - Embedded into Playwright


  test("[UI-Tars - Business]a user can search then view a product", async ({ page, ai, aiAssert, aiWaitFor }) => {
    await page.goto("https://www.vinted.com")

    await aiWaitFor('The country selection popup is visible')
    await ai("Select France as the country that I'm living")
    await page.waitForURL((url: URL) => url.hostname.indexOf('vinted.fr') >= 0)
    await ai("Accept all privacy preferences")
    await ai("Click Search bar, then Search 'Chanel', and press the Enter")
    await ai("Scroll down to the 1st product")
    await ai("Click the 2nd product from the 1st row in the product list")

    expect(page.url()).toContain("/items/")
    await aiAssert("Price is visible")
  })

3. Introduction UI-Tars LLM and Midscene

3.1 UI-Tars

UI-Tars is a native, open-source GUI Multimodality LLM, which is re-built on top of qwen-2.5-VL(通义千问2.5 VL). This model can process both text and GUI images simultaneously, and provides STF and DPO 2 kinds of models, with a huge amount of GUI screenshots. UI-Tars is specifically designed for interacting with GUI.

It performs well in:

Browser application
Desktop and Desktop application
Mobile and mobile application

It supports prompts in 2 languages:

Chinese
English

More details - please read from the Paper

3.2 Midscene

Midscene is a state machine, it builds a multiple-reasoning AI-Agent with provided Models.
It supports the following LLMs:

UI-Tars (the main branch doesn't support AIAssert, AIQuery, and AIWaitfor, but you can check my branch)
Qwen-2.5 VL (通义千问2.5, I really love this name...)
GPT-4o

4 The mechanism between UI-Tars & Midscene

4.1 Orchestrations and Comparisons

Its core capability is to plan, reason, and execute multiple steps autonomously, just like a human, based on both visual input and instructions—continuing until it determines that the task is complete.

It possesses 3 key abilities:

Multi-Step Planning Across Platforms – Given an instruction, it can plan multiple actions across web browsers, desktop, or mobile applications.
Tool Utilization for Execution – It can leverage external tools to carry out the planned actions.
Autonomous Reasoning & Adaptation – It can determine whether the task is complete or take additional actions if necessary.

I compared the most popular solutions in the market until the end of 2025-02:

Solutions	is it an AI Agent	Cost	Additional Input to LLM	how to get html element	Multiple Step Decision & Autonomous Reasoning	Playwright Integration	Mobile App support	Desktop App
UI-Tars(/GPT-4o) + Midscene	Yes	1.8$/h with UI-Tars:7B OR ~0.1263875$/test with OpenAI GPT-4o	GUI Screenshot	GUI Screenshot Processing	Yes	Yes	Yes	Yes
Llama 3.2 + Binded Tools + LangGraph	Yes	0.2$ / tests	HTML	HTML DOM processing	No	Yes	Not yet	No
ZeroStep / auto-playwright	Kind of	Unknown	HTML	HTML DOM processing	No	Yes	Unknown	No
StageHand(GPT-4o or Claude 3.5)	Yes	Unknown	HTML & GUI Screenshot	HTML DOM Processing	Not yet	Yes	Not yet	No

To summarise - a solution UI-Tars(or GPT-4o) with Midscene seems is the most applicable and cheapest approach.

4.2 Multiple-Step decisions and reasoning

Let's have a look at an actual step - Search 'Chanel', and press the Enter from the above example.

4.2.1 Midscene sends a system message to UI-Tars

Midscene sends the test step as part of the System Message to the LLM, together with the current screenshot.

4.2.2 UI-Tars return the through and Action

Because this "User step" requires multiple browser actions, like identifying where is the search bar, then click the search bar, then type "channel", and pressing "Enter" at the end. Thus UI-Tars make 1st decision to "click the search bar".

4.2.3 UI-Tars start reasoning and plan the next browser actions for the same user step iteratively

Midscene currently takes screenshots before each reasoning, so UI-Tars always knows the latest state in the browser, besides of that, UI-Tars currently sends all chat history back to UI-Tars when it is reasoning.

4.2.4 UI-Tars autonomously check whether the user step is achieved

5. Code

6. The cost for this PoC

(UI-Tars costs)

(GPT-4o costs)

7. References

@software{Midscene.js,
  author = {Zhou, Xiao and Yu, Tao},
  title = {Midscene.js: Let AI be your browser operator.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}