<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zhaopeng Xuan</title>
    <description>The latest articles on DEV Community by Zhaopeng Xuan (@robin_xuan_nl).</description>
    <link>https://dev.to/robin_xuan_nl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2854976%2F4a24425b-e641-4ee1-9719-b08c50cf1d7d.png</url>
      <title>DEV Community: Zhaopeng Xuan</title>
      <link>https://dev.to/robin_xuan_nl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/robin_xuan_nl"/>
    <language>en</language>
    <item>
      <title>S1 - agent0 curriculum agent / AgentEvolver Self-question/CuES在测试用例生成时的比较</title>
      <dc:creator>Zhaopeng Xuan</dc:creator>
      <pubDate>Mon, 12 Jan 2026 14:50:21 +0000</pubDate>
      <link>https://dev.to/robin_xuan_nl/s1-agent0-curriculum-agent-agentevolver-self-questioncueszai-ce-shi-yong-li-sheng-cheng-shi-de-bi-jiao-2d4j</link>
      <guid>https://dev.to/robin_xuan_nl/s1-agent0-curriculum-agent-agentevolver-self-questioncueszai-ce-shi-yong-li-sheng-cheng-shi-de-bi-jiao-2d4j</guid>
      <description>&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;方面&lt;/th&gt;
&lt;th&gt;Agent0 Curriculum Agent&lt;/th&gt;
&lt;th&gt;AgentEvolver Self-Question&lt;/th&gt;
&lt;th&gt;CuES&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;环境信息依赖&lt;/td&gt;
&lt;td&gt;系统角色定义：需要人工定义一个系统角色，例如“你是一个数学专家， 出数学相关的题目”&lt;/td&gt;
&lt;td&gt;环境配置信息：需要人工定义一个环境的profile，例如包含环境里的entities， 并且需要参考环境中的seed_task作为示例task(并不直接执行seed_task)&lt;/td&gt;
&lt;td&gt;环境的概念池： 需要从环境中的seed tasks中提取概念&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;环境锚定策略&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;在生成探索prompt时，使用env_profile作为环境的描述。 并且使用环境中的种子问题作为在探索时的示例问题： 例如“用户可能会问： [种子问题]”&lt;/td&gt;
&lt;td&gt;在生成探索prompt时，环境内自带的system prompt将会被用于环境的描述&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task合成路径&lt;/td&gt;
&lt;td&gt;直接通过被训练的基座模型前向传播，生成包括task和预测的final anwser&lt;/td&gt;
&lt;td&gt;4阶段： &lt;br&gt; - 探索阶段：根据env_profile来引导生成不同的trajectories&lt;br&gt; - 合成阶段： 根据env_profile和trajectories来反推出sythetic tasks(包含query, confidence, ground_truth/action_sequence)&lt;br&gt; - 过滤阶段：过滤掉重复的任务/没有预测答案的任务， 最后无法执行或执行失败的任务&lt;br&gt; - 改写ground_truth阶段：在过滤时，如果任务成功执行，则通过真实执行的traj改写合成任务的ground_truth&lt;/td&gt;
&lt;td&gt;4阶段:&lt;br&gt;- 探索阶段：根据concept pool和人工探索方向来为每一步生成3元组合(s, a, o)&lt;br&gt;- 合成阶段：根据同一个env_id下生成的多个[（s1,a1,o1），(s2,a2,o2)]三元组合反推出sytehtic tasks(包含 query, description, confidence, ground_truth/action_sequence)&lt;br&gt;- 过滤阶段: 过滤掉无法执行或执行失败的任务&lt;br&gt;- 扩展阶段：将成功执行的sythetic task的query进行相同语意下的难易度改写，扩展合成task的数量&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;引导信号&lt;/td&gt;
&lt;td&gt;根据被测agent的反馈，计算4种reward的信号：&lt;br&gt;- reward of unconsistency&lt;br&gt;- reward of correct format&lt;br&gt;- reward of tool_usage&lt;br&gt;- penalty of repetit questions &lt;br&gt;使用GRPO更新curriculum agent的权重(policy)来引导他创建出新的task&lt;br&gt;循环往复，不断根据reward提升task的生成能力&lt;/td&gt;
&lt;td&gt;使用高温LLM在规定的env_profile的指引下进行探索， 在探索时期，通过prompt控制广度探索和深度探索&lt;br&gt;&lt;br&gt;在生成阶段通过trajectory, 之前找到的新task, 还有env_profile来引导&lt;/td&gt;
&lt;td&gt;通过conept pool和人工提供的探索方向，引导在规定的范围内探索，通过exploration_memory优先探索之前未探索的&lt;br&gt;&lt;br&gt;在生成阶段，通过env_description(沙盒的system prompt) 和exploration_memory还有triplets来引导生成&lt;br&gt;&lt;br&gt;在扩展阶段，根据prompt进行相同语义不同难度的改写&lt;br&gt;合成的task的&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;难易程度控制&lt;/td&gt;
&lt;td&gt;通过self-consistency的阈值来控制&lt;/td&gt;
&lt;td&gt;通过在prompt中的exploration principals来控制，从广度到深度的扩展&lt;/td&gt;
&lt;td&gt;通过对成功执行的sythetic task的query进行相同语义不同难度的改写&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;任务演化逻辑&lt;/td&gt;
&lt;td&gt;Adversirial RL自学习， 通过GRPO计算advantage, 进行反向传播&lt;/td&gt;
&lt;td&gt;静态生成，通过不断生成的trajectory作为上下文，来提供经验来学习&lt;/td&gt;
&lt;td&gt;静态生成后，通过exploration memory和triplets作为上下文，来提供自我学习，然后动态通过query改写的方法提供不同难度的任务(任务语义不变)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;有效测试的定义&lt;/td&gt;
&lt;td&gt;让被测agent感到模糊的测试用例&lt;/td&gt;
&lt;td&gt;物理上可执行且符合难易程度的ground_truth 用例&lt;/td&gt;
&lt;td&gt;物理上可执行且符合难易程度的ground_truth 用例&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
    </item>
    <item>
      <title>P3 - CuES 合成训练数据集的方式在agent测试中的应用</title>
      <dc:creator>Zhaopeng Xuan</dc:creator>
      <pubDate>Mon, 12 Jan 2026 12:12:10 +0000</pubDate>
      <link>https://dev.to/robin_xuan_nl/p3-ceus-he-cheng-xun-lian-shu-ju-ji-de-fang-shi-zai-agentce-shi-zhong-de-ying-yong-3bbh</link>
      <guid>https://dev.to/robin_xuan_nl/p3-ceus-he-cheng-xun-lian-shu-ju-ji-de-fang-shi-zai-agentce-shi-zhong-de-ying-yong-3bbh</guid>
      <description>&lt;p&gt;论文：&lt;a href="https://arxiv.org/pdf/2512.01311" rel="noopener noreferrer"&gt;链接&lt;/a&gt;&lt;br&gt;
代码： &lt;a href="https://github.com/modelscope/AgentEvolver/tree/main/research/CuES" rel="noopener noreferrer"&gt;链接&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  CuES中的探索
&lt;/h3&gt;

&lt;p&gt;CuES希望人类可以输入一个探索的方向（例如一句话），并且交互沙盒可以提供一套概念池， 在具体实现中，这个概念池的创建是通过读取交互沙盒中100个种子任务，让llm提取这种子task里名词/概念。 这也就是CeUS所说的自上而下的引导。&lt;/p&gt;

&lt;p&gt;CuES另外一个功能是自下而上的探索，实际上是让运行沙盒首先提供一个system prompt作为环境的初始观察Observation0, 然后考虑概念池和探索方向，在当前状况下，探索至多20步，生成至多20个triplets, 在20个连续的步骤结束后，他会重新使用LLM 对当前这个env_id/task_id下所有生成的triplets计算exploratory memory， 这个memory将会被用在新任务合成阶段，作为指导。&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;init_obs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;#这里将沙盒提供的system prompt也作为历史
&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;obs_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;#这一步将推理出的action传入沙盒执行
&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;save_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;span class="nc"&gt;Triplet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;env_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;env_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;save_history&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]),&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;obs_str&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CuES中的任务合成
&lt;/h3&gt;

&lt;p&gt;即通过上一步针对每一个task_id/env_id下，生成的所有triplets 和env_id/task_id对应的探索记忆，来合成新的任务，任务里同样包含预测trajectory steps, 这个预测的完成合成task的路径就会被保存作为ground_truth。 &lt;/p&gt;

&lt;h3&gt;
  
  
  CuES合成任务的验证
&lt;/h3&gt;

&lt;p&gt;同理，合成的任务将在对应的task_id/env_id下执行，这个这行也是multi-turn的执行，最终的主要验证合成任务的可执行性的标准是将执行的steps和合成的task传给一个llm, 即使用llm-as-a-judge的方式判定这个问题是否争取的完成&lt;/p&gt;

&lt;h3&gt;
  
  
  成功执行的合成任务再改写
&lt;/h3&gt;

&lt;p&gt;注意，这里并不是改写之前成功执行的合成任务的ground_truth， 而是改写执行成功的合成任务的query， 来让他变难还是变简单，再次扩展出新的合成任务。&lt;/p&gt;

&lt;h3&gt;
  
  
  主要实现流程
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feoxv5f5hl174jvki25nu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feoxv5f5hl174jvki25nu.png" alt=" " width="800" height="519"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffcfz5ws1po4suupqjt00.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffcfz5ws1po4suupqjt00.png" alt=" " width="800" height="1032"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>P2 - AgentEvolver 的self-question 部分代码精读</title>
      <dc:creator>Zhaopeng Xuan</dc:creator>
      <pubDate>Fri, 09 Jan 2026 17:21:33 +0000</pubDate>
      <link>https://dev.to/robin_xuan_nl/p2-agentevolver-de-self-question-bu-fen-dai-ma-jing-du-3k4i</link>
      <guid>https://dev.to/robin_xuan_nl/p2-agentevolver-de-self-question-bu-fen-dai-ma-jing-du-3k4i</guid>
      <description>&lt;p&gt;论文：&lt;a href="https://arxiv.org/pdf/2511.10395" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;br&gt;
代码：&lt;a href="https://github.com/modelscope/AgentEvolver" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;目标： 从Quality engineering的角度，理解AgentEvoler中self-question部分的实现，思考这个部分是否可以作为对Agent/Multi-Agent workflows的测试生成方法&lt;/p&gt;

&lt;p&gt;AgentEvolver中的一个模块是self-question， 总体思路是根据在一个可交互的沙盒中，根据人类编写的environment profile， 通过可交互沙盒中自带的seed tasks，使用SOTA大参数模型(qwen-plus)来探索沙盒，探索的结果是沙盒的能力范围（trajectory），再将每一个seed task和对应沙盒的能力范围(trajectories)传入给一个更大的SOTA模型(qwen3-235b-a22b-instruct-2507)，让它通过能力范围反推出新的合成task（更广范围，更深层次)， 再通过2次过滤（重复task, 无法执行的task），剩下可以执行的合成task, 这些可以执行的合成task的trajectory将会变成ground_truth（标准答案），因此在整个self-question阶段，没有任何对self-question的奖励机制。 &lt;/p&gt;

&lt;p&gt;因此在self-question的部分，核心逻辑就是用大模型来生成合成task, 这就是ground_truth, 来强化学习小模型(qwen-2.5-14B-Instruct)&lt;/p&gt;

&lt;p&gt;以下是在self-question中每个阶段的重要步骤： &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhimq8a8iebweotzpcs89.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhimq8a8iebweotzpcs89.png" alt="step1" width="800" height="1036"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdl7ept8b8rl1vqkq6jn4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdl7ept8b8rl1vqkq6jn4.png" alt="Step 2" width="800" height="987"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic0irymsd90cyvf1uhux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic0irymsd90cyvf1uhux.png" alt="Step 3" width="800" height="889"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  思考
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. 交互沙盒的特殊性和局限性
&lt;/h4&gt;

&lt;p&gt;沙盒是有限的，但是实际用户的世界是无限的，合成的task仅局限于当前的沙盒，如果需要新的世界，那么需要建立不同的可交互沙盒，这增加了在测试过程中的成本。&lt;/p&gt;

&lt;p&gt;除此之外，该文在self-question中的主要核心思想，是在SOTA LLM的基座上，如果能完成的task，他的trajectory就是ground truth, 也就是这一步是生成ground truth的步骤，但如果可交互沙盒内的状态机本来就有bug，这可能会导致生成的测试也不完全正确，那么对于可交互沙盒内的状态机还需要测试，也增加了单独使用self-question模块作为测试用例生成的成本和不确定性&lt;/p&gt;

&lt;h4&gt;
  
  
  2. 大模型到小模型的强化训练中task生成，这种思路是否可以移植到QA的测试用例生成？
&lt;/h4&gt;

&lt;p&gt;部分可以，但是当测试agent / multi-agent workflow时， 当前self-question生成的测试都是大模型已经可以完成的，如果大模型可以成功执行的task在小模型中执行失败，这种test case 可以发现小模型的局限性，但是别人可以说-为什么我的被测agent/multi-agent workflow 不直接使用大模型，这种测试在一定情况下是有用的，例如agent/multi-agent workflow由于成本和效率的原因，必须使用较小模型。 &lt;/p&gt;

&lt;h4&gt;
  
  
  3. 双重成本
&lt;/h4&gt;

&lt;p&gt;在探索可交互沙盒时，需要使用高温SOTA来实际运行, 而在过滤阶段，为了生成ground_truth, 我们让然需要再次使用SOTA来实际执行合成的task。&lt;/p&gt;

&lt;h4&gt;
  
  
  4. task生成的深度和广度
&lt;/h4&gt;

&lt;p&gt;为了合成更广和更深层次的task, 必须需要使用极大的LLM基座，在本文中是qwen3-235b-a22b-instruct-2507，task生成的广度和深度完全由一个静态prompt控制，好处是整个生成过程不需要vLLM，但不太理想的地方是全程SOTA模型， 且遗传生成的控制方法单一。&lt;/p&gt;

&lt;h4&gt;
  
  
  5. 在探索和回放时的Multi-turn 交互
&lt;/h4&gt;

&lt;p&gt;在探索和过滤阶段，他都使用了multi-turn的方式，这样可以得到更准确的trajectory, 即更加精确的ground_truth。 &lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
    </item>
    <item>
      <title>P1- Agent0 中的curriculum agent精读</title>
      <dc:creator>Zhaopeng Xuan</dc:creator>
      <pubDate>Tue, 06 Jan 2026 23:22:20 +0000</pubDate>
      <link>https://dev.to/robin_xuan_nl/p1-agent0-zhong-de-curriculum-agentjing-du-39mi</link>
      <guid>https://dev.to/robin_xuan_nl/p1-agent0-zhong-de-curriculum-agentjing-du-39mi</guid>
      <description>&lt;p&gt;原论文：&lt;a href="https://arxiv.org/pdf/2511.16043" rel="noopener noreferrer"&gt;Agent0: Unleashing Self-Evolving Agents from Zero Data&lt;br&gt;
via Tool-Integrated Reasoning&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;原代码：&lt;a href="https://github.com/aiming-lab/Agent0" rel="noopener noreferrer"&gt;https://github.com/aiming-lab/Agent0&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;目标：从Quality Engineering的角度来理解Agent0的curriculum agent是如何训练的，是否可以用作agent/agent workflows的自动化动态测试中？ &lt;/p&gt;
&lt;h3&gt;
  
  
  论文中的亮点
&lt;/h3&gt;

&lt;p&gt;论文的论点是建立2个agent不断的进行螺旋上升(Virtuous Cycle)来提升问题处理的能力，且不通过人工标注数据（aka Zero Data）：1个agent是curriculum agent来生成“问题”，另外一个executor agent来生成对应的“答案”，curriculum agent 的目的是在t时刻尽最大努力找到可以让t-1时刻的executor agent迷惑的问题，然后executor agent尽最大努力解决t时刻curriculum agent生成的问题，生成t时刻的weight(policy)， 如此循环往复，curriculum agent不断的找到executor agent的能力边缘，而executor agent不断的提升自己的能力范围。&lt;/p&gt;
&lt;h3&gt;
  
  
  Curriculum agent的亮点
&lt;/h3&gt;

&lt;p&gt;作为QA， 我主要关心如何可以找到executor agent的能力范围，并且在能力范围的边缘或能力范围空间内的黑点处生成有效的test case，发现quality degradation. &lt;/p&gt;

&lt;p&gt;Curriculum agent主要的工作流程是：&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S1 - 使用LLM L1 作为初始模型/权重： 在代码中使用的是Qwen/Qwen3-4B-Base,&lt;/li&gt;
&lt;li&gt;S2 - 开始以LLM L1 当前权重进行强化学习&lt;/li&gt;
&lt;li&gt;S3 - 强化学习 - 在LLM L1中根据以下prompt生成X个predicts，包含问题和参考答案
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are an expert competition-math problem setter.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FIRST, in your private scratch-pad, think step-by-step to design a brand-new, non-trivial problem. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The problem could come from any field of mathematics, including but not limited to algebra, geometry, number theory, combinatorics, prealgebra, probability, statistics, and calculus. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Aim for a difficulty such that fewer than 30 % of advanced high-school students could solve it. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avoid re-using textbook clichés or famous contest problems.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;THEN, without revealing any of your private thoughts, output **exactly** the following two blocks:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;question&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{The full problem statement on one or more lines}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/question&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\boxed{final_answer}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do NOT output anything else—no explanations, no extra markup.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate one new, challenging reasoning question now. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remember to format the output exactly as instructed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;S4 - 强化学习 - 将新生成的questions发给executor agent去做（没有参考答案），一个问题做N遍，获得N个回答&lt;/li&gt;
&lt;li&gt;S5 - 强化学习 - 对每一个问题x和N个答案计算reward，计算advantage, 然后使用GRPO更新权重&lt;/li&gt;
&lt;li&gt;重复&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Curriculum agent的奖励函数
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;R(format): 格式正确性&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;R(Uncertain): self-consistency, 同一问题多个答案的自身一致性&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;R(tool_usage): 是否频繁调用外部工具&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;-R(repetition): 生成的问题的重复性&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  在不考虑EA的情况下将CA单独训练的潜在风险
&lt;/h3&gt;
&lt;h4&gt;
  
  
  1. 错误的参考答案（golden anwser）
&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;以下代码出现在start_vllm_server_tool.py， 当CA在训练过程中，会使用这个代码来在vLLM上启动EA, 那么当EA在当前权重下获得答案后，会来计算self-consistency 的score。 注意：这并不是在EA训练过程中发现的，提出以下这一点仅是当我们需要将CA单独使用的情况下。 在EA的训练过程中，EA读取的是CA训练后生成的问题， 而EA训练过程中不参考CA提供的anwser，而是再次对同一个问题生成多个anwser，然后通过majority vote生成peusdo-label来训练EA， EA的训练过程不在本文的讨论范围内。&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;curriculum agent 必须比 executor agent聪明，在上面提到&lt;code&gt;S3&lt;/code&gt;这一步，如果生成的参考答案（golden answer)不正确，而executor agent生成的答案是正确的，在这种情况下，无法生成正确的score(如果参考答案 != majority_ans，即使计算出了score, 也会设为0), 以下是Agent0中对应的代码&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;grade_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;majority_ans&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;golden_answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  扩展Curriculum agent的挑战
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. self-consistency的计算
&lt;/h4&gt;

&lt;p&gt;Reward函数中的其中一个重要的函数是通过回答的不确定性来确定execute agent的能力边缘, 即选择问题的self-consistency 在[0.3，0.8]之间的问题（不是太简单，不是太难，而是围绕着让EA困惑的问题）&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqkkfzpddqsq5ed68tqp8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqkkfzpddqsq5ed68tqp8.png" alt=" " width="676" height="110"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;而这个函数的计算方式在agent0中是找出答案中的majority vote, 假设对同一个问题x1,有回答[y1, y2, y3, y4], 例如问题x1是“1+1等于几”，回答是[2,2,3,2], 那么majority vote 就是“2”， 那么self-consistency是：&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                           3 / 4 = 0.75
            R(uncertain) = 1 - 2 * | 0,75 - 0.5 | = 0.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;问题来了， 如果不是客观题，而是主观题，假设问题x2是“今天荷兰天气如何？”，有回答["天气不错"， “下雨”， “小雨”， “多云转小雨”]， 那么在当前的agent0代码基础上就无法直接计算self-consistency。 &lt;/p&gt;

&lt;h4&gt;
  
  
  2. 工具调用的激励
&lt;/h4&gt;

&lt;p&gt;在CA中，另外一个重要的激励函数是工具的调用数量，在论文中，在[0,4]次的工具调用，会得到工具调用的激励，重要的一点是，工具调用的数量是从CA生成的question/task的文本中来提取的，而不是从EA执行后的trajectory中提取，这种在question/task进行显示的工具调用描述，有一个很大的优势就是CA不会因为EA的能力缺陷而塌缩，假设如果我们从EA执行的trajectory中提取工具调用，而EA没有执行某个工具的能力，那CA无法得到奖励。我认为这里需要一种结合的方式，可能会更优。&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 以下是Agent0中的实际实现的代码，在CA计算reward的时候
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_tool_reward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;

    &lt;span class="n"&gt;tool_call_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\`\`\`output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;capped_calls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;capped_calls&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  3. multi-turn 对话
&lt;/h4&gt;

&lt;p&gt;当CA在训练的时候，Agent0的论文建议让EA使用特殊的system prompt来进行多轮对话而不是直接回答问题， 目的是让EA有更好的执行结果，促使CA可以进化的更好。在CA训练的时候，因为EA其实是一个LLM的wrapper, 他没有工具调用，因此以下prompt做了特殊的处理， 强制让EA在需要调用工具时（执行python）输出需要执行的代码。&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 这个prompt是来自Agent0的代码，这个和论文中A.2 Prompt不太一样，但是目的一样
&lt;/span&gt;
&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;First, provide your reasoning and write a self-contained Python code block wrapped in \`\`\`python ... \`\`\` to help you calculate the answer. You must use the `print()` function to output the results.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;After you write the code block, STOP. I will execute it for you.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I will then provide the output under &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Code execution result:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. You must use this result (even if it&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s an error) to continue your reasoning and provide the final answer.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The final answer must be enclosed in &lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;boxed{...}.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Code Format:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Each code snippet is wrapped between \`\`\`. You need to use print() to output intermediate results.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer Format:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The last part of your response should be: &lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;boxed{...}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4. vLLM 的应用
&lt;/h4&gt;

&lt;p&gt;在executor agent推理答案过程中，在agent0的例子中，它使用verl在vllm中进行批量推理来批量获得问题的答案，这全部得益于vLLM。 假设executor agent的LLM基座不在vLLM，而是SOAT 的模型（例如Gemini 3), 更或者这个executor agent不是一个单一agent,而是一个agent workflows, 那么我们将承担比较大的时间成本（executor agent的运行）和金钱成本（token的消耗）， 还有可能它触发rate limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  实验
&lt;/h3&gt;

&lt;p&gt;我选择使用Runpod来只训练curriculum agent，有以下问题需要注意&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;对于curriculum agent和executor agent, 我使用Qwen/Qwen3-0.6B-Base， 2x32G的CUDA devices, 将2个agent分别隔离在各自的CUDA Device&lt;/li&gt;
&lt;li&gt;如果使用Qwen/Qwen3-4B-Base, 那么curriculum agent的训练至少需要48G&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>testing</category>
    </item>
    <item>
      <title>Say Hello to Your New QA Teammate: E2E Test AI Agent</title>
      <dc:creator>Zhaopeng Xuan</dc:creator>
      <pubDate>Mon, 25 Aug 2025 23:18:20 +0000</pubDate>
      <link>https://dev.to/robin_xuan_nl/5-minutes-of-human-ai-interaction-from-requirements-to-e2e-test-result-1o71</link>
      <guid>https://dev.to/robin_xuan_nl/5-minutes-of-human-ai-interaction-from-requirements-to-e2e-test-result-1o71</guid>
      <description>&lt;p&gt;&lt;em&gt;For some background, see &lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-context-demo-with-ui-tars-llm-midscene-part-1-5dbh"&gt;The Past Story about using UI-Tars for AI Testing&lt;/a&gt;, which shows practical AI applications in test automation using UI-Tars and Midscene.js.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;Last updated: 2025-09-11&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What you’ll gain from this blog
&lt;/h2&gt;

&lt;p&gt;1️⃣ &lt;strong&gt;You’ll see how AI can supercharge End-to-End testing — not in theory, but through real-world practical demos.&lt;/strong&gt; I’ll walk you through with a real-world demo which solves a few day-to-day use cases, and demonstrate the following &lt;strong&gt;KEY features of this E2E Test AI-Agent&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write tests entirely in natural language — no coding required&lt;/li&gt;
&lt;li&gt;Create tests interactively, like pair-testing with a QA engineer&lt;/li&gt;
&lt;li&gt;AI Agent handles all test data preparation(ex: users, product etc)&lt;/li&gt;
&lt;li&gt;Generated test can be executed later as part of the existing Regression Test suite&lt;/li&gt;
&lt;li&gt;Automatically heal broken element locators and test steps&lt;/li&gt;
&lt;li&gt;Achieve higher stability with less flakiness than manually written scripts&lt;/li&gt;
&lt;li&gt;Identify elements by image prompts&lt;/li&gt;
&lt;li&gt;Run tests at the same speed as Playwright scripts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2️⃣ I will introduce &lt;strong&gt;the architecture&lt;/strong&gt; of this E2E Test AI Agent &lt;strong&gt;on top of your own E2E automation framework with &lt;a href="https://midscenejs.com/" rel="noopener noreferrer"&gt;Midscene.js&lt;/a&gt;&lt;/strong&gt;, to enable you can have your own Agent.&lt;/p&gt;

&lt;p&gt;3️⃣ &lt;strong&gt;In conclusion&lt;/strong&gt;, I’ll share future steps and upcoming challenges, drawing on insights gathered from a real workshop with a UX designer, a Product Manager, and a QA engineer.&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;At the end, you can build your own E2E Test AI Agent in your organization, and benefit from it as your new teammate, to achieve a Shift-lefted scalable, stable, fast, low-maintenance, and cheap AI-Based E2E test Solution.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's see the Demo first
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;00:00 Interactive generate a test by reading acceptance criteria&lt;/li&gt;
&lt;li&gt;01:50 Run the generated test from the beginning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/pWwptoHq-mY"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Use cases will be demonstrated via the E2E test AI Agent
&lt;/h2&gt;

&lt;p&gt;These are the goals I would like to achieve by applying LLM in E2E automation testing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Make E2E test automation writing and maintenance more efficient&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;from &lt;strong&gt;~30 minutes&lt;/strong&gt; to write 2/3 similar E2E tests to &lt;strong&gt;0 minutes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;from &lt;strong&gt;endless maintenance&lt;/strong&gt; to &lt;strong&gt;a few hours&lt;/strong&gt; of maintenance per week. &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Shift E2E automation testing left&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;from &lt;strong&gt;QA drafting test cases&lt;/strong&gt; to &lt;strong&gt;reusing PM's acceptance criteria &amp;amp; UX's design as test cases&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Unblock Product managers and UX designers in E2E automation testing&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;from &lt;strong&gt;coding skills required&lt;/strong&gt; to &lt;strong&gt;no coding skills required&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key features with Support Use cases
&lt;/h2&gt;

&lt;h4&gt;
  
  
  1. Principals
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15d86nnvv4j93vbiwwkm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15d86nnvv4j93vbiwwkm.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine every Quality participant has an E2E Test AI Agent sitting beside you, listening closely as you walk through the acceptance criteria for each feature you want to validate, almost like pair testing with an AI partner.&lt;/p&gt;

&lt;p&gt;As a human: you describe your acceptance criteria step by step, feature by feature, aligning them with the UX design, while collaborating with the AI Agent.&lt;/p&gt;

&lt;p&gt;As the E2E Test AI Agent: it listens to each request in sequence, generating code along the way, and eventually assembling them into a complete, reusable E2E test script to be executed in the CI repeatedly.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Natural language creates tests interactively
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcyew9xqf9yo17eq3y8z6.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcyew9xqf9yo17eq3y8z6.gif" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The E2E Test AI Agent we developed at CreativeFabrica has knowledge of the application’s context and was able to interpret the contex such as what “Home page” means. &lt;/p&gt;

&lt;h4&gt;
  
  
  3. Handles Test data preparation
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6hc0hqqw0e8s2crfjf8p.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6hc0hqqw0e8s2crfjf8p.gif" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unlike typical QA AI agents, the solution we developed integrates with our existing E2E automation framework. As a result, it can go beyond simple browser interactions to perform backend operations, such as preparing test data without relying on the UI.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Generated Less flaky &amp;amp; self-healing test code
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmxb1gw9ozgcz2gsr78wk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmxb1gw9ozgcz2gsr78wk.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rather than simply translating natural language into &lt;a href="https://midscenejs.com/" rel="noopener noreferrer"&gt;Midscene.js&lt;/a&gt; calls, we designed our code generation to follow testing best practices. This way, we avoid the flaky behavior you’d often see with plain midscene.js, especially around async waits and lazy loading.&lt;/p&gt;

&lt;p&gt;For example, when Human said: "Click the 1st Product",  it will generate the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aiWaitFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1st product is visible&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nx"&gt;cleanPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/product/autopub-graphic/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="nf"&gt;aiTap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1st product&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It leverages AI to identify and click the “1st product,” but the action is wrapped with our best-practice code to ensure the generated test remains stable while it is self-healing.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Locate elements by screenshot
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0f2i72iu5wuqcmcce901.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0f2i72iu5wuqcmcce901.gif" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thanks to the new feature in &lt;a href="https://midscenejs.com/" rel="noopener noreferrer"&gt;Midscene.js&lt;/a&gt;, our E2E Test AI Agent can now use images as locators! Instead of relying only on traditional selectors, we can simply take a screenshot of the application or even a UI design and use it directly as a locator. &lt;/p&gt;

&lt;p&gt;This makes it much easier to test complex systems, especially those built with Canvas or even AI-driven interfaces—where conventional locators often fall short.&lt;/p&gt;

&lt;h4&gt;
  
  
  6. Fast test execution &amp;amp; low cost relatively
&lt;/h4&gt;

&lt;p&gt;Thanks to the &lt;a href="https://midscenejs.com/" rel="noopener noreferrer"&gt;Midscene.js&lt;/a&gt; caching mechanism, once a test case is generated, most elements used during AI actions are stored. This means we don’t need to call the LLM on every CI execution(save budget), only when elements change or when the planned steps are updated. The basic test execution is Playwright-based, you will enjoy the execution velocity from Playwright.&lt;/p&gt;




&lt;h2&gt;
  
  
  E2E Test AI Agent Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc8lb9o27qxxvyv1vp6xi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc8lb9o27qxxvyv1vp6xi.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the heart of our design is a “plug-and-play” philosophy. We treat existing tools like midscene.js functions, Playwright functions, and our in-house E2E automation framework as interchangeable modules that the AI agent can call, almost like snapping together Lego blocks.&lt;/p&gt;

&lt;p&gt;But we didn’t stop there. Both midscene.js and our framework are repackaged with best practices baked right into the code generation process, so the generated code is stable and reliable by design.&lt;/p&gt;

&lt;p&gt;To boost the agent’s reasoning ability, we also added a dual-layer RAG setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The first layer helps the agent with semantic understanding from the Human's input.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The second layer supports LLM action planning of each step from the function &lt;code&gt;aiAction()&lt;/code&gt; exposed by midscene.js&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This way, the agent doesn’t just execute blindly, it understands the intent, makes a plan, and then carries it out with solid testing practices behind it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Component/Layers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;- Layer 1: From Human Language to Defined Tools&lt;/strong&gt;&lt;br&gt;
Translate natural-language acceptance criteria into structured actions and assertions that can be executed, and get different format outputs. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Layer 2: Framework Integration&lt;/strong&gt;&lt;br&gt;
Wrap your existing E2E test framework, Midscene.js functions, and Playwright native functions as LLM tools, and keep all LLM Tools can share a single Playwright Browser Context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;- Layer 3: AI-Driven Planning&lt;/strong&gt;&lt;br&gt;
Midscene.js orchestrates the rest: planning AI steps, interpreting the current screenshot and HTML DOM, and deciding the next best action autonomously.&lt;/p&gt;

&lt;h3&gt;
  
  
  The test case metadata
&lt;/h3&gt;

&lt;p&gt;In the current paradigm, a test case created and autonomously executed by an E2E-Test AI Agent contains 3 core metadata layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Acceptance Criteria (Human Input):&lt;br&gt;
The intent, expressed directly in natural language by humans.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Executable Test Code (Auto-generated):&lt;br&gt;
Playwright-based code that integrates seamlessly with your existing E2E framework.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Element Locator Cache (AI-generated by Midscene.js):&lt;br&gt;
Cached mappings of HTML elements for fast execution. Only when the cache expires or is missing does the agent call the LLM again to resolve new locators.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With these three, a test case is no longer a static artifact, but a self-adaptive entity that balances human intent, system execution, and AI-assisted resilience.&lt;/p&gt;

&lt;p&gt;Here are the exported files from the demo in the video:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbearc6xu07ewlktjfcp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbearc6xu07ewlktjfcp.png" alt="Exported files"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  My Conclusions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;AI in E2E automation testing is poised to grow rapidly and inevitably expand into many other areas of testing. This is not just a possibility, but a clear and unstoppable trend. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;With the current state of AI, relying solely on LLMs to handle complex automated testing is still challenging. However, when combined with existing E2E test automation frameworks, it can deliver a much better and more practical experience.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We will undoubtedly see more AI-powered practical testing tools emerging. But I strongly recommend using Midscene.js when it comes to integration with Playwright. We may soon see a new solution called Vibium (which, as I understand, is still under development and has been proposed by the creator of Selenium).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For QA engineering, this is more than just the development of a new tool, it represents a transformation in how we approach the entire testing process and quality management.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;I set up a workshop together with a QA engineer, a Product manager, and a UX designer to join a 1-hour workshop session to play with this E2E Test AI Agent with LLM Gemini 2.5-Flash. I collected the following challenges: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Intelligence: Can the system accurately understand the intent of a test step? To achieve this, we need well-defined acceptance criteria, along with clear guidelines on how to write acceptance criteria in a way that can be interpreted by an LLM.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Accuracy: Can the LLM’s vision correctly determine the coordinates of elements to interact with? When using gemini-2.5-flash, this model can accurately locate around 80% of larger HTML elements, but about 20% have coordinate deviations. For smaller elements, like checkboxes, it often fails to locate them entirely. To address this, we employ different models for different tasks: a smaller model for semantic analysis, a “deep-thinking” model such as DeepSeek R3 for planning, and a vision-optimized model like UI-Tars for precise element localization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Step Interactivity, Repeatability, Reversibility, and Stability: Each interaction between the human and the AI should be repeatable, reversible, and stable. We observed that the AI sometimes performs unnecessary actions, which can still generate code. Humans may need to correct previous mistakes, so every step must be atomic to ensure reliability and proper rollback.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  I’d love to hear from you!
&lt;/h2&gt;

&lt;p&gt;Feel free to like, comment, or share this blog, your feedback means a lot!&lt;/p&gt;

&lt;h3&gt;
  
  
  Big thanks to these guys:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@software{Midscene.js,
  author = {Zhou, Xiao and Yu, Tao},
  title = {Midscene.js: Let AI be your browser operator.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>qa</category>
      <category>playwright</category>
      <category>testing</category>
    </item>
    <item>
      <title>Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent (Part 3)</title>
      <dc:creator>Zhaopeng Xuan</dc:creator>
      <pubDate>Wed, 26 Feb 2025 09:54:27 +0000</pubDate>
      <link>https://dev.to/robin_xuan_nl/stage-conclusion-ui-tars-rag-a-new-approach-for-automated-e2e-exploratory-testing-part-3-p65</link>
      <guid>https://dev.to/robin_xuan_nl/stage-conclusion-ui-tars-rag-a-new-approach-for-automated-e2e-exploratory-testing-part-3-p65</guid>
      <description>&lt;p&gt;&lt;strong&gt;Articles in this series:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-context-demo-with-ui-tars-llm-midscene-part-1-5dbh"&gt;Part 1 - Practical Applications of AI in Test Automation — Context, Demo with UI-Tars LLM &amp;amp; Midscene&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-ui-tars-vs-gpt-4opart-2-in-midscene-3cci"&gt;Part 2 - Data: UI-Tars VS GPT-4o in Midscene&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/stage-conclusion-ui-tars-rag-a-new-approach-for-automated-e2e-exploratory-testing-part-3-p65"&gt;Part 3 - Stage Conclusion: UI-Tars + RAG = Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the final article in this series. &lt;/p&gt;

&lt;p&gt;I will present an example demonstrating:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to integrate UI-Tars' system-reasoning 2 with our locally built RAG using Ollama and LangChain to create a system that understands high-level user instructions and automates execution based on browser screenshots at each stage.&lt;/li&gt;
&lt;li&gt;Verify the capability of &lt;code&gt;system-2 reasoning&lt;/code&gt; after combining the &lt;code&gt;UI-Tars&lt;/code&gt; &amp;amp; local &lt;code&gt;RAG&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  1. End-to-End Demo
&lt;/h2&gt;

&lt;p&gt;This Demo uses &lt;a href="//miro.com"&gt;Miro&lt;/a&gt; as an example to demonstrate the capability to handle a non-B2C &amp;amp; complicated system. The AI Agent follows &lt;strong&gt;a user's single instruction: "Create a new board with 2 sticky notes and link these 2 sticky notes by a line. "&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️VERY IMPORTANT⚠️&lt;/strong&gt;: The Demo uses Miro's Free Plan, which is open to everyone who can use the Free plan. The test is executed less than 10 times to verify the stability. The authentication part uses my personal Miro Free account(hardcoded already to avoid any other risks). I strongly ask any readers who want to reproduce this test for any customer-facing products &lt;strong&gt;should NOT&lt;/strong&gt; impact the normal usage of the product, and &lt;strong&gt;MUST&lt;/strong&gt; follow up the policy of the product respectively.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/L-_vJ23O118"&gt;
&lt;/iframe&gt;
&lt;br&gt;
&lt;strong&gt;(👀 Except Authentication, rest of actions are planed and executed by the AI-Agent with reading a single &lt;code&gt;High-level User instruction&lt;/code&gt;)&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The explanation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before the test, we need to deploy &lt;code&gt;UI-Tars-7B-SFT&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 1: Deploy &lt;code&gt;UI-Tars-7B-SFT&lt;/code&gt; to Hugging Face, to L40S 1 x GPU, you can follow up the steps from &lt;a href="https://juniper-switch-f10.notion.site/UI-TARS-Model-Deployment-Guide-17b5350241e280058e98cea60317de71" rel="noopener noreferrer"&gt;here&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Step 2: Config the &lt;code&gt;.env&lt;/code&gt; file for &lt;code&gt;Midscene&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPENAI_API_KEY="hf_" // this is can be the HF access key
OPENAI_BASE_URL="https://something/v1" // this can be HF Endpoint URL
MIDSCENE_MODEL_NAME="ui-tars-7b-sft" 
MIDSCENE_USE_VLM_UI_TARS=1 // must tell Midscene to switch to UI-Tars

MIDSCENE_LANGSMITH_DEBUG=1 // enable trace send to Langsmith, help us debug and test this AI Agent

LANGSMITH_API_KEY= 
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_TRACING_V2=true
LANGSMITH_PROJECT=

DEBUG=true //This will enable the OpenAI SDK to print Debug logs to the console
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Step 3: Building own &lt;code&gt;Ollama&lt;/code&gt; + &lt;code&gt;Langchain&lt;/code&gt; environment locally, and pull the &lt;code&gt;nomic-embed-text&lt;/code&gt; as &lt;code&gt;embeddings&lt;/code&gt;, and a RAG locally which contains "very structured" documents. 
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyi7e29xnebtl6xgx596o.png" alt="ollama" width="800" height="273"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The test did 3 steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 1: Authenticated, save the authentication state to the local file, it is implemented via writing the &lt;code&gt;Playwright&lt;/code&gt; code by following the guides from &lt;a href="https://playwright.dev/docs/auth" rel="noopener noreferrer"&gt;Playwright Authentication&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Step 2: Pass &lt;strong&gt;only one High-level user instruction&lt;/strong&gt; to the &lt;code&gt;ai()&lt;/code&gt;, let &lt;code&gt;ai()&lt;/code&gt; handle to plan and execute it. This function is exposed by &lt;a href="https://github.com/web-infra-dev/midscene" rel="noopener noreferrer"&gt;Midscene&lt;/a&gt;, a tool created by ByteDance.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;Passing&lt;/span&gt; &lt;span class="nx"&gt;only&lt;/span&gt; &lt;span class="nx"&gt;ONE&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt; &lt;span class="nx"&gt;instruction&lt;/span&gt; &lt;span class="nx"&gt;to&lt;/span&gt; &lt;span class="nx"&gt;UI&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;Tars&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;UI&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;Tars&lt;/span&gt; &lt;span class="nx"&gt;only&lt;/span&gt; &lt;span class="nx"&gt;use&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt; &lt;span class="nx"&gt;single&lt;/span&gt; &lt;span class="nx"&gt;instruction&lt;/span&gt; &lt;span class="nx"&gt;to&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="nx"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; 

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`**ID: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt; &lt;span class="nf"&gt;uuidv4&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, 这是一个新指令，完全忽略之前的记忆，严格按照指令和规则执行，禁止幻想**
Given A free plan user creates a new board without upgrade,
And the user creates 2 sticky notes in 2 different side of the grid, 
Then the user adds a line from 1st sticky note to the 2nd sticky note`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Step 3: Use AI to assert the final state via &lt;code&gt;aiAssert()&lt;/code&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. The Stage Conclusions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 System-2 Reasoning Capability
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2501.12326" rel="noopener noreferrer"&gt;Referenced Paper&lt;/a&gt; System-1 reasoning refers&lt;br&gt;
to the model directly producing actions without chain-of-thought, while system-2 reasoning involves a more&lt;br&gt;
deliberate thinking process where the model generates reasoning steps before selecting an action &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-context-demo-with-ui-tars-llm-midscene-part-1-5dbh"&gt;Part-1&lt;/a&gt;, it demonstrated an example by using &lt;a href="//vinted.com"&gt;Vinted&lt;/a&gt;. It shows a good result for &lt;code&gt;System-1&lt;/code&gt; reasoning by transferring the human language to a single browser action without chain-of-thought.  &lt;/p&gt;

&lt;p&gt;The above video demonstrated the ability of &lt;code&gt;UI-Tars&lt;/code&gt; about &lt;code&gt;System-2&lt;/code&gt; reasoning by having a chain of thought, including our own local RAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway messages&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;comparing with gpt-4o, UI-Tars has better ability of system-2 reasoning. (data from - &lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-ui-tars-vs-gpt-4opart-2-in-midscene-3cci"&gt;Part-2&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;However, having proper reasoning from UI-Tars for a specific sector or a product requires a structured RAG or fine-tuning UI-Tars with our own input data. (data from - (RAG for UI-Tars)[&lt;a href="https://github.com/web-infra-dev/midscene/issues/426%5D" rel="noopener noreferrer"&gt;https://github.com/web-infra-dev/midscene/issues/426]&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For a B2C site, with a small RAG together with UI-Tars-7B-SFT, we can achieve a very high-level instruction - ex: &lt;code&gt;I want to buy a bag&lt;/code&gt; in Vinted. (It has been tested against &lt;a href="//vinted.com"&gt;Vinted.com&lt;/a&gt;, but I cannot share the demo because the test was treated as a robot, and the action breached Vinted's policy.)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 Applicable products and scenarios
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;This solution is already capable of serving as a supplement to existing regression E2E automated testing for B2C websites, even without building RAG, but it requires a few system-2 reasonings. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This solution is capable as part of automated exploratory testing for B2C applications, and others require RAG building or fine-tuning UI-Tars-SFT with their own structured business data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This solution is capable of doing GUI and OCR checking to replace your current Screenshot Assertions within existing playwright/puppeteer tests. (Only if the purpose of your screenshot assertions is NOT comparing the UI style with given pictures)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.3 Problems &amp;amp; Risk
&lt;/h3&gt;

&lt;p&gt;You might be excited! It can already address real challenges in QA engineering, but it requires cautious use and ongoing development.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;One of the problems in Quality Assurance engineering may migrate from "The automated test is flaky/outdated" to "The AI-decided Automated test result is NOT trustable". &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For situations using RAG together with UI-Tars, it works stable with 7B-SFT, but doesn't work well(or at all) with 7B-DPO, so it requires more effort to make the solution be more scalable. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How to apply this AI-Agent as a Virtual QA engineer into existing SDLC faces challenges in terms of when and how, and the stability of the result, although it's already quite stable for System-reasoning 1 and partially for System-reasoning 2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If one of your purposes is to achieve the high velocity of test execution, then it may not work very well for complicated web applications, such as some SaaS application, when using too much System-2 reasoning, because UI-Tars has its own short-term and long-term memory, jumping to an "incorrect" cache will lead the result to the Moon...&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>web</category>
      <category>testing</category>
      <category>ai</category>
      <category>uitars</category>
    </item>
    <item>
      <title>Data - UI-Tars VS GPT-4o in Midscene (Part 2)</title>
      <dc:creator>Zhaopeng Xuan</dc:creator>
      <pubDate>Mon, 24 Feb 2025 18:32:36 +0000</pubDate>
      <link>https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-ui-tars-vs-gpt-4opart-2-in-midscene-3cci</link>
      <guid>https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-ui-tars-vs-gpt-4opart-2-in-midscene-3cci</guid>
      <description>&lt;p&gt;&lt;strong&gt;Articles in this series:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-context-demo-with-ui-tars-llm-midscene-part-1-5dbh"&gt;Part 1 - Practical Applications of AI in Test Automation — Context, Demo with UI-Tars LLM &amp;amp; Midscene&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-ui-tars-vs-gpt-4opart-2-in-midscene-3cci"&gt;Part 2 - Data: UI-Tars VS GPT-4o in Midscene&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/stage-conclusion-ui-tars-rag-a-new-approach-for-automated-e2e-exploratory-testing-part-3-p65"&gt;Part 3 - Stage Conclusion: UI-Tars + RAG = Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Reading Part 1 first could help to learn the context of the following data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a series documentation. In &lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-context-demo-with-ui-tars-llm-midscene-part-1-5dbh"&gt;Part 1&lt;/a&gt;, We have generally understood what &lt;code&gt;UI-Tars&lt;/code&gt; LLM is and how &lt;code&gt;Midscene&lt;/code&gt; orchestrates it. In this article, I mainly want to delve deeper into comparing &lt;code&gt;UI-Tars&lt;/code&gt; and &lt;code&gt;GPT-4o&lt;/code&gt; in AI-powered automation testing, via using &lt;code&gt;Midscene&lt;/code&gt;,  to identify the differences, pros, and cons. &lt;/p&gt;

&lt;p&gt;By default, &lt;code&gt;Midscene&lt;/code&gt; supports &lt;code&gt;gpt-4o&lt;/code&gt;, &lt;code&gt;qwen-2.5VL&lt;/code&gt;, and &lt;code&gt;ui-tars&lt;/code&gt; since Feb 2025.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Comparing GPT-4o and UI-Tars results when having the same test case as inputs
&lt;/h2&gt;

&lt;p&gt;We will analyze and compare different kinds of test steps. &lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Step 1 - AI Assertion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aiWaitFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;The country selection popup is visible&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;GPT-4o&lt;/th&gt;
&lt;th&gt;UI-Tars:7B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;return_format&lt;/td&gt;
&lt;td&gt;Json_Object&lt;/td&gt;
&lt;td&gt;Plain text, but Json String&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steps&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;duration&lt;/td&gt;
&lt;td&gt;3.09s&lt;/td&gt;
&lt;td&gt;2.27s 👍&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cost&lt;/td&gt;
&lt;td&gt;0.0035775$&lt;/td&gt;
&lt;td&gt;0.0011$ 👍&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;token&lt;/td&gt;
&lt;td&gt;1390&lt;/td&gt;
&lt;td&gt;2924&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temperature&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message&lt;/td&gt;
&lt;td&gt;Browser Screenshot, Instructions&lt;/td&gt;
&lt;td&gt;Browser Screenshot, Instructions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;result&lt;/td&gt;
&lt;td&gt;{"pass":true, "thought":"The screenshot clearly shows a country selection popup with a list of countries, confirming the assertion." }&lt;/td&gt;
&lt;td&gt;{   "pass": true,   "thought": "The country selection popup is visible on the screen, as indicated by the title 'Where do you live?' and the list of countries with their respective flags." }&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  1.2 Step 2 - Simple Action Step
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;    &lt;span class="nx"&gt;Select&lt;/span&gt; &lt;span class="nx"&gt;France&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;country&lt;/span&gt; &lt;span class="nx"&gt;that&lt;/span&gt; &lt;span class="nx"&gt;I&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;m living
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;GPT-4o&lt;/th&gt;
&lt;th&gt;UI-Tars:7B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;return_format&lt;/td&gt;
&lt;td&gt;Json_Object&lt;/td&gt;
&lt;td&gt;Plain text, formated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steps&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;duration&lt;/td&gt;
&lt;td&gt;3.22s 👍&lt;/td&gt;
&lt;td&gt;4.54s(2.54s + 1.99)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cost&lt;/td&gt;
&lt;td&gt;0.0109475$&lt;/td&gt;
&lt;td&gt;0.0023$ 👍&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;token&lt;/td&gt;
&lt;td&gt;4184&lt;/td&gt;
&lt;td&gt;14987(5110 + 9877)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temperature&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message&lt;/td&gt;
&lt;td&gt;Screenshot, Instructions, part of HTML Tree&lt;/td&gt;
&lt;td&gt;Screenshot, Instructions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;result&lt;/td&gt;
&lt;td&gt;{    "actions":[       {          "thought":"..",          "type":"Tap",          "param":null,          "locate":{             "id":"cngod",             "prompt":"..."          }       }    ],    "finish":true,    "log":"...",    "error":null }&lt;/td&gt;
&lt;td&gt;Iterated 2 steps, every time it return  Thought:.... Action: &amp;lt;click\&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;To click "France" in the country popup,  GPT-4o only requires 1 LLM call due to GPT-4o returning the &lt;code&gt;action&lt;/code&gt; and &lt;code&gt;finish&lt;/code&gt; in one reply. But UI-Tars requires 2 LLM calls based on its reasoning, the first call returns the action, and then after the &lt;code&gt;action&lt;/code&gt; is executed, the 2nd LLM call sends a screenshot again to *&lt;em&gt;check whether the "user's instruction" is finished. *&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1.3 Step - Complicated Step
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Click&lt;/span&gt; &lt;span class="nx"&gt;Search&lt;/span&gt; &lt;span class="nx"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;then&lt;/span&gt; &lt;span class="nx"&gt;Search&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Chanel&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;and&lt;/span&gt; &lt;span class="nx"&gt;press&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;Enter&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;GPT-4o&lt;/th&gt;
&lt;th&gt;UI-Tars:7B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;return_format&lt;/td&gt;
&lt;td&gt;Json_Object&lt;/td&gt;
&lt;td&gt;Plain text, formated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steps&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;duration&lt;/td&gt;
&lt;td&gt;3.52s 👍&lt;/td&gt;
&lt;td&gt;12.16s (4+2.77+2.36+3.03)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cost&lt;/td&gt;
&lt;td&gt;0.01268$&lt;/td&gt;
&lt;td&gt;0.00608$ 👍&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;token&lt;/td&gt;
&lt;td&gt;4646&lt;/td&gt;
&lt;td&gt;49123&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temperature&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message&lt;/td&gt;
&lt;td&gt;Screenshot, Instructions, part of HTML Tree&lt;/td&gt;
&lt;td&gt;Screenshot, Instructions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;result&lt;/td&gt;
&lt;td&gt;Returns 3 actions in one reply {    "actions":[       {...}, {...}    ],    "finish":true,    "log":"...",    "error":null }&lt;/td&gt;
&lt;td&gt;Iterated 4 steps, every time it return  Thought:.... Action: &amp;lt;click\&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;This step is a bit complicated for both, but they both can handle it, but with a big difference. There is no reasoning in GPT-4o's result, it generated 3 &lt;code&gt;actions&lt;/code&gt; and marked &lt;code&gt;finish&lt;/code&gt; as true before the action started... however UI-Tars is quite slow relatively due to its reasoning - it makes a single action, then reflects it and then decide next action, in total, it also generates 3 &lt;code&gt;actions&lt;/code&gt;, plus one &lt;code&gt;check&lt;/code&gt; at the end to verify the current stage meet the user's expectation based on the given "user's instruction" &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1.4 Step - Scrolling to an unpredictable position
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Scroll&lt;/span&gt; &lt;span class="nx"&gt;down&lt;/span&gt; &lt;span class="nx"&gt;to&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;st&lt;/span&gt; &lt;span class="nx"&gt;product&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;GPT-4o&lt;/th&gt;
&lt;th&gt;UI-Tars:7B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;return_format&lt;/td&gt;
&lt;td&gt;Json_Object&lt;/td&gt;
&lt;td&gt;Plain text, formated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Steps&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;duration&lt;/td&gt;
&lt;td&gt;7.58 (3.01+4.57)&lt;/td&gt;
&lt;td&gt;5.47s (3.18+2.29) 👍&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cost&lt;/td&gt;
&lt;td&gt;0.01268$&lt;/td&gt;
&lt;td&gt;0.002735$ 👍&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;token&lt;/td&gt;
&lt;td&gt;12557&lt;/td&gt;
&lt;td&gt;14994&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;temperature&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message&lt;/td&gt;
&lt;td&gt;Screenshot, Instructions, part of HTML Tree&lt;/td&gt;
&lt;td&gt;Screenshot, Instructions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;result&lt;/td&gt;
&lt;td&gt;{    "actions":[       {...}    ],    "finish":true,    "log":"...",    "error":null }&lt;/td&gt;
&lt;td&gt;Iterated 4 steps, every time it return  Thought:.... Action: &amp;lt;click\&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Wow! Due to there being no reasoning in GPT-4o usages, to scroll to an unknown position requires 2 GPT-4o calls, 1st call generates a &lt;code&gt;scrolling action&lt;/code&gt;,  and the 2nd call &lt;code&gt;checks&lt;/code&gt; the screenshot and then decides no need to scroll anymore. &lt;br&gt;
But when playing with UI-Tars, it just works as normal, 1st call makes the decision of action, and the 2nd to validates again from the new screenshot after the browser action is executed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1.5 “Vision” Comparision
&lt;/h3&gt;

&lt;p&gt;UI-Tars and GPT-4o both have their own vision. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;GPT-4o&lt;/th&gt;
&lt;th&gt;UI-Tars:7B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vision&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4gspeqgahkswa588qe7j.png" alt="Image description" width="800" height="450"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpgzs3ht1bpzkur6kcm3.png" alt="Image description" width="800" height="468"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;For some small or overlapping elements, &lt;code&gt;GPT-4o&lt;/code&gt; is unable to recognize them, whereas UI-Tars:7B can achieve almost complete recognition.  Unfortunately &lt;code&gt;gpt-4o&lt;/code&gt; even cannot identify the "login" button in the top banner.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  2. Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Ability to handle complicated case
&lt;/h3&gt;

&lt;p&gt;*&lt;em&gt;GPT-4o: *&lt;/em&gt; Because there is no reasoning or no system-2 reasoning, &lt;code&gt;gpt-4o&lt;/code&gt; can handle some straightforward test steps, but for some very general steps such as &lt;code&gt;i login&lt;/code&gt;, it roughly cannot handle this case at all. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UI-Tars:7B&lt;/strong&gt; Because it has reasoning, it can support up to 15 steps, meaning 14 actions + 1 final check, which is his reasoning top limit. &lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Speed
&lt;/h3&gt;

&lt;p&gt;In most cases, You have to double the LLM calls if you use &lt;code&gt;UI-Tars&lt;/code&gt; comparing using &lt;code&gt;GPT-4o, because&lt;/code&gt;UI-Tars` requires a final check for each "user instruction". &lt;/p&gt;

&lt;p&gt;But &lt;code&gt;gpt-4o&lt;/code&gt; in most cases doesn't require validation for "user instructions" after it generates actions. &lt;/p&gt;

&lt;p&gt;So &lt;code&gt;gpt-4o&lt;/code&gt; is faster, but may be dangerous for the test result.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Level of Trust
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;UI-Tars&lt;/code&gt; always check whether the previously planned action achieved the "user instruction", so &lt;code&gt;UI-Tars&lt;/code&gt; traded time for accuracy, &lt;code&gt;UI-Tars&lt;/code&gt; only marks &lt;code&gt;finish=true&lt;/code&gt; when it checks the screenshot again after the action,  but &lt;code&gt;gpt-4o&lt;/code&gt; directly return &lt;code&gt;finish=true&lt;/code&gt; even before the generated actions are executed... &lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 Perception of GUI Element
&lt;/h3&gt;

&lt;p&gt;From section 1.4, we can clearly notice that &lt;code&gt;UI-Tars&lt;/code&gt; can even identify the small elements on the page and overlapped elements, but it failed for &lt;code&gt;gpt-4o&lt;/code&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  2.5 Additional input for LLM
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;UI-Tars&lt;/code&gt; only makes a decision based on the screenshot(mimicking human vision),  but &lt;code&gt;gpt-4o&lt;/code&gt; requires building a partial HTML tree, which may slow down &lt;code&gt;gpt-4o&lt;/code&gt; due to the increasing size of the HTML code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.6 Costs
&lt;/h3&gt;

&lt;p&gt;If we deploy &lt;code&gt;UI-Tars&lt;/code&gt; to our own infrastructure, then to achieve the same result as &lt;code&gt;gpt-4o&lt;/code&gt; or an even better result, you can save 50% - 75% costs. &lt;/p&gt;

&lt;h2&gt;
  
  
  3. Conclusions
&lt;/h2&gt;

&lt;p&gt;Overall, if we plan to apply AI in real day-to-day work, I believe &lt;code&gt;UI-Tars&lt;/code&gt; can do a better job than &lt;code&gt;gpt-4o&lt;/code&gt; in the above context. &lt;/p&gt;

&lt;p&gt;However, how to speed up &lt;code&gt;UI-Tars&lt;/code&gt;'s reasoning, will be one of the challenges in the near future.  &lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>testing</category>
      <category>uitars</category>
    </item>
    <item>
      <title>Practical Applications of AI in Test Automation - Context, UI-Tars LLM , Midscene (Part 1)</title>
      <dc:creator>Zhaopeng Xuan</dc:creator>
      <pubDate>Mon, 24 Feb 2025 12:30:23 +0000</pubDate>
      <link>https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-context-demo-with-ui-tars-llm-midscene-part-1-5dbh</link>
      <guid>https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-context-demo-with-ui-tars-llm-midscene-part-1-5dbh</guid>
      <description>&lt;p&gt;&lt;strong&gt;Articles in this series:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-context-demo-with-ui-tars-llm-midscene-part-1-5dbh"&gt;Part 1 - Practical Applications of AI in Test Automation — Context, Demo with UI-Tars LLM &amp;amp; Midscene&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/practical-applications-of-ai-in-test-automation-ui-tars-vs-gpt-4opart-2-in-midscene-3cci"&gt;Part 2 - Data: UI-Tars VS GPT-4o in Midscene&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/robin_xuan_nl/stage-conclusion-ui-tars-rag-a-new-approach-for-automated-e2e-exploratory-testing-part-3-p65"&gt;Part 3 - Stage Conclusion: UI-Tars + RAG = Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This series of articles will provide a clear and practical guide to AI applications in end-to-end test automation. I will use AI to verify a product's end-to-end functionality, ensuring that it meets the required specifications. &lt;/p&gt;

&lt;p&gt;Let's watch a demo first 👀👀, and then I will elaborate on how it works after.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/-JHyjn6EXvg"&gt;
&lt;/iframe&gt;
&lt;br&gt;
(The video is not accelerated. I use Vinted.com as an example because I heard about it very frequently from my wife...&lt;/p&gt;

&lt;p&gt;I tell the AI Agent that it must open the home page, search for a product, then open a product detail page, and check the price on the Product page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The video above demonstrates how the AI Agent perceives the process—autonomously interpreting business-oriented test cases, evaluating the webpage's current state(screenshot), making plans and decisions, and executing the test. It engages in multi-step decision-making, leveraging various types of reasoning to achieve its goal.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Reviewing the Role of End-to-End Testing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F050wh8x3lm77ect9b7dr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F050wh8x3lm77ect9b7dr.png" alt="Image description" width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is crucial to emphasize once again that:&lt;/strong&gt; the primary goal of end-to-end testing is to validate that new features and regression functionality match the product requirements and design by simulating the customers' behavior.&lt;/p&gt;

&lt;p&gt;End-to-end testing is a testing approach widely used in regression validation. It can be performed either manually—such as by writing a sanity checklist and executing tests manually- or through automation by writing test scripts using tools like Playwright or Appium.&lt;/p&gt;

&lt;p&gt;The three key aspects of end-to-end testing are described in the above figure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understand How Users Benefit from the Feature&lt;/strong&gt; – Identify the value the feature brings to users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design Test Cases from the User's Perspective&lt;/strong&gt; – Create test scenarios that align with real user interactions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate Test Execution During Development&lt;/strong&gt; – Continuously run test cases throughout the development process to verify that the implemented code meets the required functionality.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  2. In Action – Executing Your Test Cases with an AI Agent
&lt;/h2&gt;

&lt;p&gt;In traditional end-to-end test automation, the typical approach is as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Analyze the functionality&lt;/strong&gt; – Understand the feature and its expected behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyze and write test cases&lt;/strong&gt; – Define test scenarios based on user interactions and requirements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write automation scripts&lt;/strong&gt; – Implement test cases using automation frameworks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When writing automated test cases, we usually create Page Object-like classes to represent the HTML tree, allowing the test script to interact with or retrieve elements efficiently. &lt;/p&gt;

&lt;p&gt;Now, let's see how an AI Agent can optimize this process. &lt;/p&gt;
&lt;h3&gt;
  
  
  2.1 - A Multiple-decisions AI Agent can optimize the test process
&lt;/h3&gt;

&lt;p&gt;From the above video, this is the test case that AI Agent read:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="kn"&gt;Scenario&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; Customer can search and open product detail page from a search result

&lt;span class="err"&gt;Go to https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="err"&gt;//www.vinted.com&lt;/span&gt;
&lt;span class="err"&gt;The&lt;/span&gt; &lt;span class="err"&gt;country&lt;/span&gt; &lt;span class="err"&gt;selection&lt;/span&gt; &lt;span class="err"&gt;popup&lt;/span&gt; &lt;span class="err"&gt;is&lt;/span&gt; &lt;span class="err"&gt;visible&lt;/span&gt;
&lt;span class="err"&gt;Select&lt;/span&gt; &lt;span class="err"&gt;France&lt;/span&gt; &lt;span class="err"&gt;as&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;country&lt;/span&gt; &lt;span class="err"&gt;that&lt;/span&gt; &lt;span class="err"&gt;I'm&lt;/span&gt; &lt;span class="err"&gt;living&lt;/span&gt;
&lt;span class="err"&gt;Accept&lt;/span&gt; &lt;span class="err"&gt;all&lt;/span&gt; &lt;span class="err"&gt;privacy&lt;/span&gt; &lt;span class="err"&gt;preferences&lt;/span&gt;
&lt;span class="err"&gt;Search&lt;/span&gt; &lt;span class="err"&gt;'Chanel',&lt;/span&gt; &lt;span class="err"&gt;and&lt;/span&gt; &lt;span class="err"&gt;press&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;Enter&lt;/span&gt;
&lt;span class="err"&gt;Scroll&lt;/span&gt; &lt;span class="err"&gt;down&lt;/span&gt; &lt;span class="err"&gt;to&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;1st&lt;/span&gt; &lt;span class="err"&gt;product&lt;/span&gt;
&lt;span class="err"&gt;Click&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;2nd&lt;/span&gt; &lt;span class="err"&gt;product&lt;/span&gt; &lt;span class="err"&gt;from&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;1st&lt;/span&gt; &lt;span class="err"&gt;row&lt;/span&gt; &lt;span class="err"&gt;in&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;product&lt;/span&gt; &lt;span class="err"&gt;list&lt;/span&gt;
&lt;span class="err"&gt;Assert&lt;/span&gt; &lt;span class="err"&gt;Price&lt;/span&gt; &lt;span class="err"&gt;is&lt;/span&gt; &lt;span class="err"&gt;visible&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thus, the AI Agent actually read some business-oriented languages, instead of something like "Click A"/ "Type B". The AI Agent itself will make reasoning and plan the steps.&lt;/p&gt;

&lt;p&gt;To run this test, it requires the following hardware:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nvidia L40s - 1 x GPU, 48GB GPU Memory, 7 x vCPU, 40GB CPU memory&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.2 - What problems were solved by this AI Agent
&lt;/h3&gt;

&lt;p&gt;Reflecting on what we mentioned in Section 1, end-to-end testing can be performed in two ways: by writing automated test scripts or executing tests manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With the introduction of an AI Agent, a new approach emerges—simply providing your test cases to the AI Agent without writing test scripts. The AI Agent then replaces manual execution by autonomously carrying out the test cases.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Specifically, it addresses the following problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reduces Manual Testing Costs&lt;/strong&gt; – The AI Agent can interpret test cases written by anyone, eliminating the need to write test scripts and allowing tests to be executed at any time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lowers Test Script Maintenance Effort&lt;/strong&gt; – The AI Agent autonomously determines the next browser action, reducing the need to modify tests for minor UI changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increases Accessibility and Participation&lt;/strong&gt; – Shifting from traditional QA engineers writing automation scripts to a decentralized model where developers contribute, and now to a stage where anyone proficient in English and familiar with the product can write end-to-end test cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.3 - Embedded into Playwright
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;
  &lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;[UI-Tars - Business]a user can search then view a product&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;aiAssert&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;aiWaitFor&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://www.vinted.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aiWaitFor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;The country selection popup is visible&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Select France as the country that I'm living&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForURL&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;indexOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;vinted.fr&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Accept all privacy preferences&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Click Search bar, then Search 'Chanel', and press the Enter&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Scroll down to the 1st product&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Click the 2nd product from the 1st row in the product list&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;url&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;toContain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/items/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;aiAssert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Price is visible&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Introduction UI-Tars LLM and Midscene
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 UI-Tars
&lt;/h3&gt;

&lt;p&gt;UI-Tars is a native, open-source GUI Multimodality LLM, which is re-built on top of qwen-2.5-VL(通义千问2.5 VL). This model can process both &lt;strong&gt;text&lt;/strong&gt; and &lt;strong&gt;GUI images&lt;/strong&gt; simultaneously, and provides STF and DPO 2 kinds of models, with a huge amount of GUI screenshots.  UI-Tars is specifically designed for interacting with GUI. &lt;/p&gt;

&lt;p&gt;It performs well in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browser application&lt;/li&gt;
&lt;li&gt;Desktop and Desktop application&lt;/li&gt;
&lt;li&gt;Mobile and mobile application&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It supports prompts in 2 languages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chinese&lt;/li&gt;
&lt;li&gt;English&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More details - please read from the &lt;a href="https://github.com/bytedance/UI-TARS/blob/main/UI_TARS_paper.pdf" rel="noopener noreferrer"&gt;Paper&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Midscene
&lt;/h3&gt;

&lt;p&gt;Midscene is a state machine, it builds a multiple-reasoning AI-Agent with provided Models. &lt;br&gt;
It supports the following LLMs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UI-Tars (the main branch doesn't support AIAssert, AIQuery, and AIWaitfor, but you can check my branch)&lt;/li&gt;
&lt;li&gt;Qwen-2.5 VL (通义千问2.5, I really love this name...)&lt;/li&gt;
&lt;li&gt;GPT-4o&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4 The mechanism between UI-Tars &amp;amp; Midscene
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Orchestrations and Comparisons
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d6jid8shah8hououbso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d6jid8shah8hououbso.png" alt="Image description" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Its core capability is to plan, reason, and execute multiple steps autonomously, just like a human, based on both visual input and instructions—continuing until it determines that the task is complete.&lt;/p&gt;

&lt;p&gt;It possesses 3 key abilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-Step Planning Across Platforms – Given an instruction, it can plan multiple actions across web browsers, desktop, or mobile applications.&lt;/li&gt;
&lt;li&gt;Tool Utilization for Execution – It can leverage external tools to carry out the planned actions.&lt;/li&gt;
&lt;li&gt;Autonomous Reasoning &amp;amp; Adaptation – It can determine whether the task is complete or take additional actions if necessary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I compared the most popular solutions in the market until the end of 2025-02: &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Solutions&lt;/th&gt;
&lt;th&gt;is it an AI Agent&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Additional Input to LLM&lt;/th&gt;
&lt;th&gt;how to get html element&lt;/th&gt;
&lt;th&gt;Multiple Step Decision &amp;amp; Autonomous Reasoning&lt;/th&gt;
&lt;th&gt;Playwright Integration&lt;/th&gt;
&lt;th&gt;Mobile App support&lt;/th&gt;
&lt;th&gt;Desktop App&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UI-Tars(/GPT-4o) + Midscene&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;1.8$/h  with UI-Tars:7B   OR   ~0.1263875$/test with OpenAI GPT-4o&lt;/td&gt;
&lt;td&gt;GUI Screenshot&lt;/td&gt;
&lt;td&gt;GUI Screenshot Processing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 + Binded Tools + LangGraph&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;&lt;a href="https://medium.com/@abhyankarharshal22/dynamic-browser-automation-with-langchain-agent-and-playwright-tools-fill-tool-implementation-5a4953d514ac" rel="noopener noreferrer"&gt;0.2$ / tests&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;HTML&lt;/td&gt;
&lt;td&gt;HTML DOM processing&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Not yet&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZeroStep / auto-playwright&lt;/td&gt;
&lt;td&gt;Kind of&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;HTML&lt;/td&gt;
&lt;td&gt;HTML DOM processing&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StageHand(GPT-4o or Claude 3.5)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;HTML &amp;amp; GUI Screenshot&lt;/td&gt;
&lt;td&gt;HTML DOM Processing&lt;/td&gt;
&lt;td&gt;Not yet&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Not yet&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;To summarise - a solution UI-Tars(or GPT-4o) with Midscene seems is the most applicable and cheapest approach. &lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Multiple-Step decisions and reasoning
&lt;/h3&gt;

&lt;p&gt;Let's have a look at an actual step  - &lt;code&gt;Search 'Chanel', and press the Enter&lt;/code&gt; from the above example. &lt;/p&gt;

&lt;h4&gt;
  
  
  4.2.1 Midscene sends a system message to UI-Tars
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feaqpz93b4hmlt6fvfvbu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feaqpz93b4hmlt6fvfvbu.png" alt="Image description" width="800" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqunt89fz0nm7eim2zw56.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqunt89fz0nm7eim2zw56.png" alt="Image description" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Midscene sends the test step as part of the System Message to the LLM, together with the current screenshot. &lt;/p&gt;

&lt;h4&gt;
  
  
  4.2.2 UI-Tars return the through and Action
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kkk551qfyrwfmbdzq70.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kkk551qfyrwfmbdzq70.png" alt="Image description" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because this "User step" requires multiple browser actions, like identifying where is the search bar, then click the search bar, then type "channel", and pressing "Enter" at the end. Thus UI-Tars make 1st decision to "click the search bar".&lt;/p&gt;

&lt;h4&gt;
  
  
  4.2.3 UI-Tars start reasoning and plan the next browser actions  for the same user step iteratively
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffo4zojwhqhogk3enixug.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffo4zojwhqhogk3enixug.png" alt="Image description" width="800" height="151"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6vc1a432vo6htesdqb4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6vc1a432vo6htesdqb4.png" alt="Image description" width="800" height="155"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Midscene currently takes screenshots before each reasoning, so UI-Tars always knows the latest state in the browser, besides of that, UI-Tars currently sends all chat history back to UI-Tars when it is reasoning. &lt;/p&gt;

&lt;h4&gt;
  
  
  4.2.4 UI-Tars autonomously check whether the user step is achieved
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkmgsjd7eliatcrzdrc2t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkmgsjd7eliatcrzdrc2t.png" alt="Image description" width="800" height="139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Code
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/xuanzhaopeng/midscene-playwright-uitars" rel="noopener noreferrer"&gt;PoC &amp;amp; example tests&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/web-infra-dev/midscene/pull/412" rel="noopener noreferrer"&gt;Expand MidScene to support UI-Tars&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6. The cost for this PoC
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fol64ke66cfz4t7m3qemn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fol64ke66cfz4t7m3qemn.png" alt="Image description" width="800" height="105"&gt;&lt;/a&gt;&lt;br&gt;
(UI-Tars costs)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9g874k7moiasgpdbcss.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9g874k7moiasgpdbcss.png" alt="Image description" width="800" height="445"&gt;&lt;/a&gt;&lt;br&gt;
(GPT-4o costs)&lt;/p&gt;

&lt;h2&gt;
  
  
  7. References
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@software{Midscene.js,
  author = {Zhou, Xiao and Yu, Tao},
  title = {Midscene.js: Let AI be your browser operator.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>ai</category>
      <category>testing</category>
      <category>uitars</category>
    </item>
  </channel>
</rss>
