<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anton Gulin</title>
    <description>The latest articles on DEV Community by Anton Gulin (@aiwithanton).</description>
    <link>https://dev.to/aiwithanton</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872452%2F17f47297-ddc6-457c-9920-47c0dd1acd1b.png</url>
      <title>DEV Community: Anton Gulin</title>
      <link>https://dev.to/aiwithanton</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aiwithanton"/>
    <language>en</language>
    <item>
      <title>How to Score Your AI Test Agents: Offline Evaluation with Trajectories (2026)</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sat, 13 Jun 2026 20:37:36 +0000</pubDate>
      <link>https://dev.to/aiwithanton/how-to-score-your-ai-test-agents-offline-evaluation-with-trajectories-2026-dil</link>
      <guid>https://dev.to/aiwithanton/how-to-score-your-ai-test-agents-offline-evaluation-with-trajectories-2026-dil</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmsavzz8yegf78u3u4zj.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmsavzz8yegf78u3u4zj.webp" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI test agent evaluation&lt;/strong&gt; is the practice of scoring the tests an AI agent writes, instead of trusting that they pass. You record the agent's run as a trajectory (a saved log of every step), replay it offline, and grade each step for correctness and relevance. Offline scoring needs no live API calls, so you can check agent quality on every pull request.&lt;/p&gt;

&lt;p&gt;An AI agent can write 200 tests before lunch. That feels like progress.&lt;/p&gt;

&lt;p&gt;Then a real bug ships, and not one of those tests caught it. The agent was confident, and it was wrong.&lt;/p&gt;

&lt;p&gt;This guide shows how to stop guessing and start scoring. Stagehand 3.5.0 made the method first-class on June 3, 2026, but the pattern works for any agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. "It passed" is not a score
&lt;/h2&gt;

&lt;p&gt;A green test suite tells you the tests ran. It does not tell you the tests were right.&lt;/p&gt;

&lt;p&gt;An AI agent makes three mistakes a human reviewer would catch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It checks the wrong thing. The test passes, but it never asserts the real behavior.&lt;/li&gt;
&lt;li&gt;It writes flaky tests (tests that fail at random). They go green often enough to look fine.&lt;/li&gt;
&lt;li&gt;It tests a happy path and skips the edge case that actually breaks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You cannot fix what you cannot measure. So the first job is a number, not a vibe.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Record the run as a trajectory
&lt;/h2&gt;

&lt;p&gt;A trajectory is a saved recording of an agent's run. It captures each step: what the agent saw, what it decided, and what code it produced.&lt;/p&gt;

&lt;p&gt;You capture it once, during the agent's normal run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Illustrative pattern — confirm the exact Stagehand 3.5 API before use.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trajectory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveTrajectory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trajectory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;runs/checkout-flow.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The recording is the receipt. Now you can study the run after it finishes, as many times as you want.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Replay it offline
&lt;/h2&gt;

&lt;p&gt;Offline means you grade the saved run without calling the live model again. No new API cost. No flaky network. Same input every time.&lt;/p&gt;

&lt;p&gt;This matters for two reasons. It makes scoring cheap, so you can run it on every pull request. It makes scoring repeatable, so two engineers get the same result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Replay the saved run and score it, with no live API calls.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;loadTrajectory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;runs/checkout-flow.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Score each step with evaluation types
&lt;/h2&gt;

&lt;p&gt;A single pass/fail hides too much. Grade the run on a few clear axes instead.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Correctness&lt;/strong&gt;: did the test assert the behavior the task asked for?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relevance&lt;/strong&gt;: does each step move toward the goal, or wander?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stability&lt;/strong&gt;: would this test pass on a clean re-run, or is it flaky?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage&lt;/strong&gt;: did the agent test the edge case, or only the happy path?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stagehand 3.5.0 added evaluation types for exactly this kind of offline scoring. You define the rubric once and apply it to every saved run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rubric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;correctness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;asserts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;target&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;goal&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;relevance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;every&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onTask&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;stability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reruns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;every&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A run that scores &lt;code&gt;correctness 7/10, relevance pass, flaky tests 0&lt;/code&gt; is a run you can talk about. "It passed" is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Wire the score into CI
&lt;/h2&gt;

&lt;p&gt;A score you read once and forget changes nothing. Turn it into a gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CI step: fail the build if the agent's tests score too low.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npx evaluate runs/ --min-correctness 0.8 --max-flaky &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the agent earns trust the same way a junior engineer does. It ships work, the work gets graded, and only graded work reaches production.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Where this sits: the Evidence Layer
&lt;/h2&gt;

&lt;p&gt;I design AI test systems on a 3-Layer System:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt;: decides what to test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution&lt;/strong&gt;: runs the tests, where the agent writes code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence&lt;/strong&gt;: proves the work is right.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams build the first two layers and stop. They let the agent write tests and assume the green check means quality.&lt;/p&gt;

&lt;p&gt;Offline evaluation is the Evidence Layer. It is the difference between an agent you hope works and an agent you can prove works.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 5-line checklist
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Record every agent run as a trajectory.&lt;/li&gt;
&lt;li&gt;Replay it offline, with no live API calls.&lt;/li&gt;
&lt;li&gt;Score it on correctness, relevance, stability, and coverage.&lt;/li&gt;
&lt;li&gt;Gate your build on the score.&lt;/li&gt;
&lt;li&gt;Keep the trajectory, so you can re-grade when the rubric improves.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Build the agent. Then prove it works.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>ai</category>
      <category>automation</category>
      <category>playwright</category>
    </item>
    <item>
      <title>Playwright Codegen: The Complete Guide (2026)</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 08 Jun 2026 05:38:38 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-codegen-the-complete-guide-2026-3kf0</link>
      <guid>https://dev.to/aiwithanton/playwright-codegen-the-complete-guide-2026-3kf0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdaabap7z1zkbj7w3zkc.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdaabap7z1zkbj7w3zkc.webp" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playwright Codegen&lt;/strong&gt; is a native CLI (command-line interface) tool that generates test scripts automatically as you interact with a browser. It records your actions—like clicks, form inputs, and page navigation—and translates them into clean TypeScript or JavaScript test code.&lt;/p&gt;

&lt;p&gt;For most developers, writing test locators (how tests find buttons) takes up 60% of test writing time. &lt;/p&gt;

&lt;p&gt;Codegen reduces that time to zero. &lt;/p&gt;

&lt;p&gt;Here is how to use it, and how to scale it from a simple draft tool to a full production architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. How to Launch Playwright Codegen
&lt;/h2&gt;

&lt;p&gt;To start the generator, run this command in your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx playwright codegen demo.playwright.dev/todomvc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This launch command opens two windows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A browser window&lt;/strong&gt;: This is where you click, type, and record your test steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Playwright Inspector&lt;/strong&gt;: This is a tool window that displays the generated code in real time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As you click on the page, the tool writes the test code automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Capturing Assertions
&lt;/h2&gt;

&lt;p&gt;A test without assertions (checks to verify behavior) is just a script. &lt;/p&gt;

&lt;p&gt;Codegen allows you to record checks directly from the UI. &lt;/p&gt;

&lt;p&gt;In the browser window, hover over any element and click one of the check buttons in the toolbar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Assert Visibility&lt;/strong&gt;: Verifies if an element is visible on the screen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assert Text&lt;/strong&gt;: Verifies if an element contains specific text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assert Value&lt;/strong&gt;: Verifies the input value of a form field.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This generates standard assertions like &lt;code&gt;await expect(locator).toBeVisible()&lt;/code&gt; instantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Playwright Codegen Best Practices
&lt;/h2&gt;

&lt;p&gt;Generated code is a draft. &lt;/p&gt;

&lt;p&gt;To make it production-ready, apply these three rules:&lt;/p&gt;

&lt;h3&gt;
  
  
  Avoid Hardcoded Wait Times
&lt;/h3&gt;

&lt;p&gt;Codegen does not generate sleep statements. &lt;/p&gt;

&lt;p&gt;Playwright uses auto-waiting (waiting for elements to be ready). &lt;/p&gt;

&lt;p&gt;Keep it that way. &lt;/p&gt;

&lt;p&gt;Do not add manual timeouts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Semantic Locators
&lt;/h3&gt;

&lt;p&gt;Playwright prefers locators that represent user actions. &lt;/p&gt;

&lt;p&gt;Codegen generates these by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Good: accessible locator&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Submit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Bad: fragile CSS selector&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#submit-btn-2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep the accessible versions. &lt;/p&gt;

&lt;p&gt;They prevent flaky tests (tests that fail randomly).&lt;/p&gt;

&lt;h3&gt;
  
  
  Isolate Your Auth State
&lt;/h3&gt;

&lt;p&gt;Do not record login steps in every single test. &lt;/p&gt;

&lt;p&gt;Use Codegen to save your authentication state once. &lt;/p&gt;

&lt;p&gt;Run Codegen with this save option:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx playwright codegen &lt;span class="nt"&gt;--save-storage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;auth.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, configure your tests to load &lt;code&gt;auth.json&lt;/code&gt; before running. &lt;/p&gt;

&lt;p&gt;This saves hours of run time in CI (continuous integration servers).&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Architectural View: From Draft to System
&lt;/h2&gt;

&lt;p&gt;As an AI QA Architect, I view Codegen as a helper. &lt;/p&gt;

&lt;p&gt;It is the entry point of the &lt;strong&gt;Execution Layer&lt;/strong&gt; in the 3-Layer System:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt;: Decides when to run tests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution&lt;/strong&gt;: The code that runs (where Codegen helps).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence&lt;/strong&gt;: Gathers logs and traces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Codegen writes the initial code. &lt;/p&gt;

&lt;p&gt;But it cannot design the framework. &lt;/p&gt;

&lt;p&gt;It cannot handle API mocks (fake servers). &lt;/p&gt;

&lt;p&gt;It cannot govern agentic testing systems (where AI agents write and heal tests).&lt;/p&gt;

&lt;p&gt;Use Codegen to build the first block. &lt;/p&gt;

&lt;p&gt;Then build the architecture around it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>playwright</category>
      <category>testing</category>
      <category>automation</category>
      <category>devops</category>
    </item>
    <item>
      <title>Playwright vs Cypress vs Selenium in 2026</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sun, 24 May 2026 06:59:50 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-vs-cypress-vs-selenium-in-2026-35fg</link>
      <guid>https://dev.to/aiwithanton/playwright-vs-cypress-vs-selenium-in-2026-35fg</guid>
      <description>&lt;p&gt;Playwright is the best default for new browser test automation in 2026. It gives cross-browser runs, parallel CI, API checks, and AI-agent evidence in one tool. Cypress still fits JavaScript-heavy teams that want fast local feedback. Selenium still fits legacy grids and strict browser labs.&lt;/p&gt;

&lt;p&gt;That is the short answer.&lt;/p&gt;

&lt;p&gt;The better answer depends on your system.&lt;/p&gt;

&lt;p&gt;If AI agents will read your failures, the question changes.&lt;br&gt;
You are no longer picking only a test runner.&lt;br&gt;
You are picking the evidence layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed In 2026
&lt;/h2&gt;

&lt;p&gt;Most comparison posts still ask old questions.&lt;/p&gt;

&lt;p&gt;They ask which tool has cleaner syntax.&lt;br&gt;
They ask which tool is easier to learn.&lt;br&gt;
They ask which tool starts faster.&lt;/p&gt;

&lt;p&gt;Those questions still matter.&lt;br&gt;
They are no longer enough.&lt;/p&gt;

&lt;p&gt;AI agents need proof they can inspect.&lt;br&gt;
Proof means screenshots, traces, browser state, and readable failures.&lt;/p&gt;

&lt;p&gt;The human reviewer still owns the decision.&lt;br&gt;
The agent only helps when the evidence is clear.&lt;/p&gt;

&lt;p&gt;That is why Playwright now has the default seat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pick Playwright When Evidence Matters
&lt;/h2&gt;

&lt;p&gt;Pick Playwright for new end-to-end test systems.&lt;br&gt;
End-to-end means browser checks.&lt;/p&gt;

&lt;p&gt;Playwright gives you one model across Chromium, Firefox, and WebKit.&lt;br&gt;
Those are browser engines.&lt;br&gt;
They are how pages run.&lt;/p&gt;

&lt;p&gt;That matters for real product risk.&lt;/p&gt;

&lt;p&gt;It also matters for AI-agent workflows.&lt;br&gt;
AI agents means tools that act.&lt;/p&gt;

&lt;p&gt;Playwright now documents Test Agents.&lt;br&gt;
Those agents plan, generate, and repair tests.&lt;/p&gt;

&lt;p&gt;The tool also has strong receipts.&lt;br&gt;
Traces show what happened.&lt;br&gt;
Screenshots show where it happened.&lt;br&gt;
Reports help humans review the failure.&lt;/p&gt;

&lt;p&gt;Use Playwright when you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser coverage across engines&lt;/li&gt;
&lt;li&gt;parallel CI at scale&lt;/li&gt;
&lt;li&gt;trace-based debugging&lt;/li&gt;
&lt;li&gt;API and UI checks together&lt;/li&gt;
&lt;li&gt;AI-agent review paths&lt;/li&gt;
&lt;li&gt;long-term framework ownership&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CI means server test runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pick Cypress When Local Feedback Matters Most
&lt;/h2&gt;

&lt;p&gt;Cypress is still useful.&lt;/p&gt;

&lt;p&gt;That sentence matters.&lt;br&gt;
Tool debates get lazy when one side becomes a villain.&lt;/p&gt;

&lt;p&gt;Cypress can be a strong fit for frontend teams.&lt;br&gt;
It works well when developers want quick local feedback.&lt;br&gt;
It also fits teams already built around Cypress Cloud.&lt;/p&gt;

&lt;p&gt;Cypress documents cross-browser testing.&lt;br&gt;
It also documents parallel runs through Cypress Cloud.&lt;/p&gt;

&lt;p&gt;That can be enough for many product teams.&lt;/p&gt;

&lt;p&gt;Use Cypress when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your app is JavaScript-first&lt;/li&gt;
&lt;li&gt;developers own most browser checks&lt;/li&gt;
&lt;li&gt;fast local debugging is the main goal&lt;/li&gt;
&lt;li&gt;Cypress Cloud is already approved&lt;/li&gt;
&lt;li&gt;browser coverage needs are narrow&lt;/li&gt;
&lt;li&gt;the suite is not agent-driven yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The risk appears later.&lt;/p&gt;

&lt;p&gt;As the suite grows, evidence gets more important.&lt;br&gt;
That is where Playwright usually wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keep Selenium When Migration Risk Is Higher
&lt;/h2&gt;

&lt;p&gt;Selenium is not dead.&lt;/p&gt;

&lt;p&gt;It is still the right answer for some teams.&lt;/p&gt;

&lt;p&gt;Keep Selenium when a grid already exists.&lt;br&gt;
Keep it when policy requires it.&lt;br&gt;
Keep it when migration risk is higher than tool value.&lt;/p&gt;

&lt;p&gt;But do not choose Selenium by default for new AI QA work.&lt;/p&gt;

&lt;p&gt;You will spend too much time rebuilding the evidence layer.&lt;br&gt;
You will also carry older suite habits forward.&lt;/p&gt;

&lt;p&gt;Selenium can be stable.&lt;br&gt;
The question is whether it helps the next system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Need&lt;/th&gt;
&lt;th&gt;Best default&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;New AI-agent test system&lt;/td&gt;
&lt;td&gt;Playwright&lt;/td&gt;
&lt;td&gt;Best evidence path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broad browser engine coverage&lt;/td&gt;
&lt;td&gt;Playwright&lt;/td&gt;
&lt;td&gt;One model across major engines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast frontend feedback&lt;/td&gt;
&lt;td&gt;Cypress&lt;/td&gt;
&lt;td&gt;Strong local developer loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Existing Cypress investment&lt;/td&gt;
&lt;td&gt;Cypress&lt;/td&gt;
&lt;td&gt;Migration may not pay yet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legacy grid policy&lt;/td&gt;
&lt;td&gt;Selenium&lt;/td&gt;
&lt;td&gt;Use what the organization can run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Greenfield QA architecture&lt;/td&gt;
&lt;td&gt;Playwright&lt;/td&gt;
&lt;td&gt;Better long-term receipts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  My Four-Question Test
&lt;/h2&gt;

&lt;p&gt;I use four questions before I choose.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Who reads the failure first?&lt;/li&gt;
&lt;li&gt;What proof do they need?&lt;/li&gt;
&lt;li&gt;Where will the suite run?&lt;/li&gt;
&lt;li&gt;What happens when the UI changes?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer includes AI agents, I lean Playwright.&lt;/p&gt;

&lt;p&gt;If the answer is one frontend team, Cypress can fit.&lt;/p&gt;

&lt;p&gt;If the answer is legacy policy, keep Selenium.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Practical Recommendation
&lt;/h2&gt;

&lt;p&gt;Start new projects with Playwright.&lt;/p&gt;

&lt;p&gt;Keep Cypress when it already serves the team.&lt;/p&gt;

&lt;p&gt;Keep Selenium when migration would create more risk.&lt;/p&gt;

&lt;p&gt;Then build the same rule across all three:&lt;/p&gt;

&lt;p&gt;Every failed test needs a receipt.&lt;/p&gt;

&lt;p&gt;That receipt should show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what action ran&lt;/li&gt;
&lt;li&gt;what page state existed&lt;/li&gt;
&lt;li&gt;what assertion failed&lt;/li&gt;
&lt;li&gt;what changed before failure&lt;/li&gt;
&lt;li&gt;what a human must decide&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tool only matters because the evidence matters.&lt;/p&gt;

&lt;p&gt;In 2026, that is the real comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  Author Bio
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>testing</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Playwright v1.60 Turns Test Failures Into Evidence</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 18 May 2026 04:54:00 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-v160-turns-test-failures-into-evidence-1ban</link>
      <guid>https://dev.to/aiwithanton/playwright-v160-turns-test-failures-into-evidence-1ban</guid>
      <description>&lt;p&gt;Playwright v1.60 makes failure evidence easier to capture during the run.&lt;/p&gt;

&lt;p&gt;The main change is scoped HAR recording.&lt;/p&gt;

&lt;p&gt;HAR means network request file.&lt;/p&gt;

&lt;p&gt;It shows what the browser sent and received.&lt;/p&gt;

&lt;p&gt;The release also adds file drops, ARIA boxes, and hard test aborts.&lt;/p&gt;

&lt;p&gt;ARIA means accessibility map.&lt;/p&gt;

&lt;p&gt;Together, these changes help CI failures explain themselves.&lt;/p&gt;

&lt;p&gt;CI means automated build server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical update
&lt;/h2&gt;

&lt;p&gt;Use &lt;code&gt;context.tracing.startHar()&lt;/code&gt; when network failures waste review time.&lt;/p&gt;

&lt;p&gt;It records a HAR file inside Playwright tracing.&lt;/p&gt;

&lt;p&gt;Tracing means run evidence capture.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;locator.drop()&lt;/code&gt; when upload tests use custom events.&lt;/p&gt;

&lt;p&gt;Drop API means file drop simulation.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;page.ariaSnapshot({ boxes: true })&lt;/code&gt; when AI tools inspect pages.&lt;/p&gt;

&lt;p&gt;Boxes mean element positions.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;test.abort()&lt;/code&gt; when shared setup finds unsafe state.&lt;/p&gt;

&lt;p&gt;Fixtures mean shared test setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;upload records network evidence&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;using&lt;/span&gt; &lt;span class="nx"&gt;har&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tracing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startHar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;upload.har&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;embed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;minimal&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;urlFilter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;api&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;upload/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/upload&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#dropzone&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;note.txt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;mimeType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text/plain&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Upload complete&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HAR starts before the page opens.&lt;/p&gt;

&lt;p&gt;The drop step sends an in-memory file.&lt;/p&gt;

&lt;p&gt;When the test scope ends, Playwright finalizes the HAR.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule
&lt;/h2&gt;

&lt;p&gt;Do not treat this release as a feature list.&lt;/p&gt;

&lt;p&gt;Treat it as an evidence upgrade.&lt;/p&gt;

&lt;p&gt;Better tests do not just pass or fail.&lt;/p&gt;

&lt;p&gt;They explain what happened.&lt;/p&gt;

&lt;p&gt;Read the canonical version:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anton.qa/blog/posts/playwright-v1-60-evidence-first-testing" rel="noopener noreferrer"&gt;https://www.anton.qa/blog/posts/playwright-v1-60-evidence-first-testing&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>automation</category>
      <category>news</category>
      <category>testing</category>
      <category>tooling</category>
    </item>
    <item>
      <title>AI Test Automation Architecture: The 3-Layer System</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sun, 17 May 2026 23:53:47 +0000</pubDate>
      <link>https://dev.to/aiwithanton/ai-test-automation-architecture-the-3-layer-system-2078</link>
      <guid>https://dev.to/aiwithanton/ai-test-automation-architecture-the-3-layer-system-2078</guid>
      <description>&lt;p&gt;AI test automation architecture is the system that tells AI what to test.&lt;/p&gt;

&lt;p&gt;It also defines how to run tests and prove the result.&lt;/p&gt;

&lt;p&gt;I split it into three layers: orchestration, execution, and evidence.&lt;/p&gt;

&lt;p&gt;Without all three, AI testing becomes prompt output with no production gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why tool lists fail
&lt;/h2&gt;

&lt;p&gt;Most AI testing content starts with tools.&lt;/p&gt;

&lt;p&gt;That is backwards.&lt;/p&gt;

&lt;p&gt;AI means software that predicts.&lt;/p&gt;

&lt;p&gt;Predictions can help QA teams move faster.&lt;/p&gt;

&lt;p&gt;But predictions do not prove quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-layer model
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Plain meaning&lt;/th&gt;
&lt;th&gt;Main question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;test control plan&lt;/td&gt;
&lt;td&gt;What risk should this cover?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Execution&lt;/td&gt;
&lt;td&gt;actual test run&lt;/td&gt;
&lt;td&gt;Did it run in the real pipeline?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence&lt;/td&gt;
&lt;td&gt;proof from runs&lt;/td&gt;
&lt;td&gt;Can a human review it?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The practical gate
&lt;/h2&gt;

&lt;p&gt;Use this before AI-generated tests ship:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gate&lt;/th&gt;
&lt;th&gt;Pass condition&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;The test maps to one named risk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data&lt;/td&gt;
&lt;td&gt;Test data setup is explicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State&lt;/td&gt;
&lt;td&gt;Browser state is controlled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run&lt;/td&gt;
&lt;td&gt;The test passes in CI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence&lt;/td&gt;
&lt;td&gt;Trace or equivalent proof exists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Review&lt;/td&gt;
&lt;td&gt;A human can explain the failure mode&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CI means automated build server.&lt;/p&gt;

&lt;p&gt;MCP means tool connection standard.&lt;/p&gt;

&lt;p&gt;Playwright is a browser test tool.&lt;/p&gt;

&lt;p&gt;Together, they can help AI agents run useful tests.&lt;/p&gt;

&lt;p&gt;But the architecture must prove each run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule
&lt;/h2&gt;

&lt;p&gt;Never ask AI to expand test coverage first.&lt;/p&gt;

&lt;p&gt;Build the proof system before that.&lt;/p&gt;

&lt;p&gt;Generation is cheap.&lt;/p&gt;

&lt;p&gt;Evidence is the architecture.&lt;/p&gt;

&lt;p&gt;Read the canonical version:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anton.qa/blog/posts/ai-test-automation-architecture-3-layer-system" rel="noopener noreferrer"&gt;https://www.anton.qa/blog/posts/ai-test-automation-architecture-3-layer-system&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>automation</category>
      <category>testing</category>
    </item>
    <item>
      <title>How to Test MCP Servers Before They Break Your CI</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 11 May 2026 18:15:50 +0000</pubDate>
      <link>https://dev.to/aiwithanton/how-to-test-mcp-servers-before-they-break-your-ci-p7b</link>
      <guid>https://dev.to/aiwithanton/how-to-test-mcp-servers-before-they-break-your-ci-p7b</guid>
      <description>&lt;p&gt;Most teams install an MCP server and hope it works.&lt;/p&gt;

&lt;p&gt;That is how you get 3 AM pages.&lt;/p&gt;

&lt;p&gt;An MCP server is a bridge between AI agents and your tools. It can crash, leak data, or silently return garbage. If your AI agent relies on it, your whole pipeline breaks.&lt;/p&gt;

&lt;p&gt;MCP means Model Context Protocol (standard tool link).&lt;/p&gt;

&lt;p&gt;Do not only test startup. Test behavior and permissions too.&lt;/p&gt;

&lt;p&gt;This post is the checklist I run on every MCP server before it touches production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three-layer test stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Discovery&lt;/td&gt;
&lt;td&gt;Missing tools, broken metadata&lt;/td&gt;
&lt;td&gt;MCP Inspector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Behavior&lt;/td&gt;
&lt;td&gt;Silent failures, wrong output&lt;/td&gt;
&lt;td&gt;pytest smoke tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Over-permissions, data leaks&lt;/td&gt;
&lt;td&gt;Permission audit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Layer 1: Discovery with MCP Inspector
&lt;/h2&gt;

&lt;p&gt;MCP Inspector is the official debugging tool. Start it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @anthropic-ai/mcp-inspector node dist/server.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Does the server start without errors?&lt;/li&gt;
&lt;li&gt;Does it list the tools it promises?&lt;/li&gt;
&lt;li&gt;Does a sample request return the right shape?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Layer 2: Behavior with pytest
&lt;/h2&gt;

&lt;p&gt;Here is a minimal smoke test. It checks that initialization returns valid JSON-RPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_mcp_server_responds&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Popen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@anthropic-ai/mcp-server-filesystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PIPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PIPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jsonrpc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;initialize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;protocolVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-11-05&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capabilities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:{},&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clientInfo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readline&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;terminate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 3: Security with a permission audit
&lt;/h2&gt;

&lt;p&gt;Check three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does it need file system access (disk read/write)? Which paths?&lt;/li&gt;
&lt;li&gt;Does it make network calls (external requests)? To which hosts?&lt;/li&gt;
&lt;li&gt;Does it run shell commands (terminal execution)? Under which user?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answers are "all files, any host, root user," block it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to find servers worth testing
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Official MCP Registry&lt;/strong&gt; — &lt;a href="https://registry.modelcontextprotocol.io" rel="noopener noreferrer"&gt;https://registry.modelcontextprotocol.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt; — Search &lt;code&gt;modelcontextprotocol&lt;/code&gt; topics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm / pip&lt;/strong&gt; — Search &lt;code&gt;@anthropic-ai/mcp-server-*&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Red flags: no commits in 6+ months, no tests, no README, permission requests that are too broad.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Testing MCP servers is not optional. An untested server is a bug waiting to become an incident.&lt;/p&gt;

&lt;p&gt;The three-layer stack catches common failure modes. MCP Inspector for manual checks. pytest for CI gates. Permission audit for last defense.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, now Lead Software Engineer in Test. &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cicd</category>
      <category>mcp</category>
      <category>testing</category>
    </item>
    <item>
      <title>Playwright MCP v0.0.73: How to Configure Browser Paths via Environment Variables</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 04 May 2026 21:37:17 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-mcp-v0073-how-to-configure-browser-paths-via-environment-variables-3fap</link>
      <guid>https://dev.to/aiwithanton/playwright-mcp-v0073-how-to-configure-browser-paths-via-environment-variables-3fap</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This post was originally published on &lt;a href="https://www.anton.qa/blog/posts/playwright-mcp-v0-0-73" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;. The canonical version lives there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Playwright MCP v0.0.73 fixes a critical gap where extension channels and executable paths could not be resolved from CI/CD environment variables.&lt;/p&gt;

&lt;p&gt;If you run Playwright MCP in Docker, Kubernetes, or ephemeral CI workers, this release removes a class of environment-specific debugging that typically consumes 15–30 minutes per incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;Two interconnected bug fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extension &lt;code&gt;channel&lt;/code&gt; and &lt;code&gt;executablePath&lt;/code&gt; now resolve from CLI flags and environment variables&lt;/strong&gt; (&lt;a href="https://github.com/microsoft/playwright/pull/40572" rel="noopener noreferrer"&gt;#40572&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--browser&lt;/code&gt; channel flags now propagate on &lt;code&gt;--extension&lt;/code&gt; paths&lt;/strong&gt; (&lt;a href="https://github.com/microsoft/playwright/pull/40567" rel="noopener noreferrer"&gt;#40567&lt;/a&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Combined, these changes mean your Playwright MCP setup can now be fully environment-driven.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PLAYWRIGHT_BROWSERS_CHANNEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;chromium
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PLAYWRIGHT_EXTENSION_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/browser-extension
npx playwright &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resolution hierarchy is now:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CLI flags (highest priority)&lt;/li&gt;
&lt;li&gt;Environment variables&lt;/li&gt;
&lt;li&gt;Config file defaults&lt;/li&gt;
&lt;li&gt;Built-in channel defaults&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  MCP Registry listing
&lt;/h2&gt;

&lt;p&gt;Playwright MCP is now published to the official &lt;a href="https://registry.modelcontextprotocol.io" rel="noopener noreferrer"&gt;MCP Registry&lt;/a&gt; on each release. This simplifies enterprise procurement and governance for teams evaluating AI-assisted testing infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha
&lt;/h2&gt;

&lt;p&gt;Environment variables set in your shell may not propagate to the MCP process spawned by your AI tool. Test this before deploying to production.&lt;/p&gt;

&lt;p&gt;For the full breakdown — including CI/CD examples and the subprocess propagation fix — read the canonical post at &lt;a href="https://www.anton.qa/blog/posts/playwright-mcp-v0-0-73" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is an AI QA Architect. Former Apple SDET, now Lead Software Engineer in Test. &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>cicd</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Native Drag-and-Drop Automation Arrives in Playwright MCP</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Tue, 28 Apr 2026 18:43:12 +0000</pubDate>
      <link>https://dev.to/aiwithanton/native-drag-and-drop-automation-arrives-in-playwright-mcp-3e16</link>
      <guid>https://dev.to/aiwithanton/native-drag-and-drop-automation-arrives-in-playwright-mcp-3e16</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Playwright MCP v0.0.71 ships &lt;code&gt;browser_drop&lt;/code&gt;. It gives you native drag-and-drop from any MCP client. No more &lt;code&gt;evaluate&lt;/code&gt; scripts. No more &lt;code&gt;mouse.move&lt;/code&gt; chains. Grid reordering, file drop zones, text editor drags — all work the same way a real user does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Release Matters
&lt;/h2&gt;

&lt;p&gt;QA teams either abandon drag-and-drop testing or hack around it. But sortable grids, file uploads, and rich text editors are everywhere. And they have been painful to test forever.&lt;/p&gt;

&lt;p&gt;I ran into this firsthand on one project. Solid Playwright coverage for clicks, typing, and navigation. But drag-and-drop? We used &lt;code&gt;evaluate&lt;/code&gt; scripts. Or we tested it by hand. Both paths broke across browsers. Both were impossible to keep working.&lt;/p&gt;

&lt;p&gt;Playwright MCP v0.0.71 fixes this with &lt;code&gt;browser_drop&lt;/code&gt;. It uses Playwright's own &lt;code&gt;Locator.drop&lt;/code&gt; — the same API your tests already use. Now any MCP client can call it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use browser_drop
&lt;/h2&gt;

&lt;p&gt;Here's a complete example combining browser_drop with the new response body capture from browser_network_requests and the simplified expression support in browser_evaluate. This pipeline automates a file upload scenario, validates the server response, and confirms the UI state update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;McpServer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/mcp.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;McpServer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;file-upload-automation&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Drop zone and file item selectors for a document management UI&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropZoneSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="upload-zone"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileItemSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="file-item"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;uploadedStatusSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="upload-status"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Tool: Simulate file drag-and-drop onto upload zone&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;upload_document_flow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Upload a document via drag-and-drop and validate response&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Name of file to upload&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;fileId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Unique file identifier&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Navigate to upload interface&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_navigate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://internal-docs.example.com/upload&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Locate drag source and drop target&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dragSource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`text=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropTarget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dropZoneSelector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Execute native drag-and-drop operation&lt;/span&gt;
    &lt;span class="c1"&gt;// browser_drop wraps Locator.drop - no evaluate or mouse.move workarounds&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_drop&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dragSource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dropTarget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;dropResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Drop operation failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;dropResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Inspect server response body with mime-type detection&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;networkCapture&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_network_requests&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;urlPattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;**/api/upload**&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;responseHeaders&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Extract upload confirmation&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;uploadResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;networkCapture&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;No upload response captured&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Validate response using plain expression (no function wrapper needed)&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;validationResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`JSON.parse(arguments[0]).status === "success"`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Confirm UI reflects successful upload&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;statusText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`document.querySelector("&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;uploadedStatusSelector&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;")?.textContent`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;uploaded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;serverResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;uiStatus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;statusText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;validationPassed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;validationResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three new tools working together: &lt;code&gt;browser_drop&lt;/code&gt; handles the drag. &lt;code&gt;browser_network_requests&lt;/code&gt; captures the server response (full body, not just status codes). &lt;code&gt;browser_evaluate&lt;/code&gt; runs plain JavaScript — no function wrapper needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gotcha Nobody Is Talking About
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;browser_drop&lt;/code&gt; needs both elements to be on screen. That's correct Playwright behavior. But here's the catch: if you navigate to a page and the drag target sits below the fold, the drop fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Call browser_evaluate to scroll the target into view before calling browser_drop, or use the scroll option if your Playwright version supports it. This catches teams off guard in CI where viewport sizes are smaller than local development.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before browser_drop: ensure target is in viewport&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`document.querySelector("&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;dropTarget&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;").scrollIntoView()`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a bug. It's how Playwright works. But it catches teams when they test on a big screen and deploy to CI. CI viewports are smaller. The element you tested locally is off screen in the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Changes in Your CI Pipeline
&lt;/h2&gt;

&lt;p&gt;With &lt;code&gt;browser_drop&lt;/code&gt;, you can test drag-and-drop flows through MCP. Not by hand. Not with broken scripts.&lt;/p&gt;

&lt;p&gt;On one project, Selenium to Playwright gave us 40% faster tests. But drag-and-drop still broke in headless mode. We wrote &lt;code&gt;evaluate&lt;/code&gt; scripts. They stopped working every sprint. &lt;code&gt;browser_drop&lt;/code&gt; puts native drag-and-drop into MCP. No scripts. No workarounds.&lt;/p&gt;

&lt;p&gt;What this actually means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fewer flaky tests.&lt;/strong&gt; Native drag-and-drop is tested across browsers. &lt;code&gt;evaluate&lt;/code&gt; + &lt;code&gt;mouse.move&lt;/code&gt; sequences are not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simpler AI test generation.&lt;/strong&gt; AI tools call &lt;code&gt;browser_drop&lt;/code&gt; directly. No fragile mouse chains.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster CI.&lt;/strong&gt; Native operations run faster than JavaScript-injected drag scripts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Playwright MCP v0.0.71 is worth upgrading for &lt;code&gt;browser_drop&lt;/code&gt; alone. The response body capture and plain expression support make it better. But drag-and-drop was the missing piece. Now it's there.&lt;/p&gt;

&lt;p&gt;The catch is real but small. Scroll your target into view before you drop. One line. Add it to your tool definitions and move on.&lt;/p&gt;

&lt;p&gt;If you run MCP-based test infrastructure, this kills the last reason to fall back to &lt;code&gt;evaluate&lt;/code&gt; for drag-and-drop. Upgrade. Add the scroll guard. Ship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference&lt;/strong&gt;: &lt;a href="https://playwright.dev/docs/api/class-locator#locator-drop" rel="noopener noreferrer"&gt;Playwright Locator.drop API documentation&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, now Lead Software Engineer in Test. Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>mcp</category>
      <category>testing</category>
    </item>
    <item>
      <title>Playwright Just Shipped the Fix For Flaky Tests I Built 3 Years Ago</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Fri, 24 Apr 2026 19:46:48 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-just-shipped-the-fix-for-flaky-tests-i-built-3-years-ago-56nf</link>
      <guid>https://dev.to/aiwithanton/playwright-just-shipped-the-fix-for-flaky-tests-i-built-3-years-ago-56nf</guid>
      <description>&lt;p&gt;I shipped a self-healing test framework three years ago. Nobody called it agentic then. The word "agent" was what your antivirus company ran on your laptop.&lt;/p&gt;

&lt;p&gt;I called my three internal components Planner, Generator, and Healer. Not because I'd read a paper — because those were the three jobs the pipeline needed and I was out of clever names.&lt;/p&gt;

&lt;p&gt;Last October, Playwright v1.56 shipped native Test Agents. Three of them.&lt;/p&gt;

&lt;p&gt;They're called &lt;strong&gt;Planner&lt;/strong&gt;, &lt;strong&gt;Generator&lt;/strong&gt;, and &lt;strong&gt;Healer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This week's v1.59 release added the infrastructure that makes the three-role pattern actually viable in production — video receipts via &lt;code&gt;page.screencast&lt;/code&gt;, MCP interop via &lt;code&gt;browser.bind()&lt;/code&gt;, and async disposables for clean resource management. The agents shipped in October. The AI test automation architecture they need shipped last week.&lt;/p&gt;

&lt;p&gt;So this is a post about a pattern that just got validated by the team that ships the framework I bet my career on. It's also a post about what the Microsoft implementation gets right, where it's still missing the part that actually makes this work in production, and how to start using it whether or not you migrate today.&lt;/p&gt;

&lt;p&gt;If you're a QA architect, test lead, or SDET who's ever been told to "just make the flaky tests pass" — this one's for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: the flake tax nobody budgets for
&lt;/h2&gt;

&lt;p&gt;Here's a number every engineering manager underestimates: the flake tax.&lt;/p&gt;

&lt;p&gt;On a team I worked with years ago — mid-stage B2B SaaS, 12 engineers, 8 services — the suite had about 1,200 end-to-end tests. Roughly 4% flaked per run. Sounds tolerable. It wasn't.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4% flake × 20 PR runs per day = ~1,000 spurious failures per week&lt;/li&gt;
&lt;li&gt;Every spurious failure triggers a re-run, a triage, a Slack thread&lt;/li&gt;
&lt;li&gt;On a good week, 3 engineers burned a full day each chasing ghosts&lt;/li&gt;
&lt;li&gt;On a bad week (release freeze, CI degradation, upstream flake) it could take the whole team for 2 sprints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the flake tax. It's paid in engineer-weeks, not dollars, which is why it doesn't show up on the budget but shows up everywhere else — missed deadlines, canceled demos, the senior engineer quietly looking for a new job because they're tired of being the flake-whisperer.&lt;/p&gt;

&lt;p&gt;The traditional fix is discipline: write better locators, wait on the right events, don't trust the backend, quarantine flakes, review the quarantine weekly, blah blah. All true. All inadequate at scale. Discipline is linear; flake is exponential.&lt;/p&gt;

&lt;p&gt;Eventually I stopped fighting flake as a writer and started designing against it as an architect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Drama: the 2-week death-march that broke me
&lt;/h2&gt;

&lt;p&gt;I won't name the company or the release. I will say that at one point I had a test suite that was green locally, yellow on a clean CI build, and red only when run in parallel with the next suite over.&lt;/p&gt;

&lt;p&gt;The failure was non-deterministic. The reproduction wasn't. It happened every Tuesday between 10:14 AM and 10:22 AM.&lt;/p&gt;

&lt;p&gt;We lost two weeks to it. I tried everything. I tried everything again. I tried everything in a different order. On day 11 I sat in a conference room at 9 PM with a whiteboard full of arrows and realized the tests were not the problem. The &lt;em&gt;test infrastructure&lt;/em&gt; was the problem. My framework assumed the application was the only thing being tested. It wasn't. The CI runner was being tested too. So was the database snapshot restore job. So was the deployment timing on the staging environment.&lt;/p&gt;

&lt;p&gt;We fixed that specific bug. But the death-march taught me the thing I'd refused to see: &lt;strong&gt;test maintenance is not a writing problem. It's an architecture problem.&lt;/strong&gt; The tests don't need more discipline. The framework around them needs more intelligence.&lt;/p&gt;

&lt;p&gt;That's where the three-role pattern was born.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: An AI Test Automation Architecture in Three Roles
&lt;/h2&gt;

&lt;p&gt;Here's the pattern, condensed. The names are mine, but the ideas were obvious once I stopped pretending they weren't separate jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Planner
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Job:&lt;/strong&gt; given a feature, a user story, or a production incident, produce a structured test plan. Not test code — a plan. A list of flows, edge cases, pre-conditions, cleanup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; planning and writing are different skills. If one component does both jobs, tests drift from plans. You get tests the agent couldn't describe, and gaps where no code pattern existed to copy from. Planning first forces completeness before cleverness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I built three years ago:&lt;/strong&gt; a template-driven plan generator that read from PR descriptions, Jira tickets, and production alerts, and produced a Markdown spec engineers reviewed before any code was written. Approval rate on plans was ~85%, and the rejected 15% were caught in minutes, not days of debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generator
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Job:&lt;/strong&gt; take an approved plan and produce the test code. Choose the locators, write the assertions, set up the fixtures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; code generation benefits from narrow context (the plan), not broad context (the whole codebase). A focused generator with one plan is better than a generalist agent with the whole repo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I built:&lt;/strong&gt; a generator that output Playwright/TypeScript tests from plan Markdown, with locator strategies (data-testid preferred, role-based fallback, text-based last-resort), fixture scaffolding, and soft-assertion patterns baked in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Healer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Job:&lt;/strong&gt; when a test fails, diagnose whether the failure is real (app bug), structural (locator stale after a UI refactor), or environmental (CI flake). Fix the structural ones. Flag the real ones. Quarantine the environmental ones with context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; and this is the one nobody wanted to believe at the time — healing is not about re-running failed tests until they pass. That's not healing; that's hiding. Real healing is triage plus targeted mutation plus review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I built:&lt;/strong&gt; a healer that diffed the current DOM against the last green run, proposed three locator candidates when the old one was stale, scored each against a stability heuristic, and opened a PR with the one-line change for a human to review. Merge rate on healer PRs was ~80%. The other 20% were caught in review, which is exactly what a healer is supposed to look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;I don't love citing numbers without naming the shop, but my feedback memory is explicit on that. So here's what's defensible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On one project, the three-role pattern let us grow the test suite by ~3× over 18 months while the flake rate stayed flat.&lt;/li&gt;
&lt;li&gt;On another, we cut the test-maintenance time-per-engineer by more than a third in the first quarter after rollout.&lt;/li&gt;
&lt;li&gt;On a third, the Healer caught a UI-refactor regression pattern (100+ tests stale from a single CSS rename) and produced a single-PR fix overnight. The alternative would have been a 3-week cleanup sprint.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These numbers are not magic. They are the mechanical consequence of separating concerns and instrumenting the boundary between them. If you already do this with your services in production, you already know why it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now Playwright Ships It Natively
&lt;/h2&gt;

&lt;p&gt;Playwright v1.56 (October 2025) released a set of Test Agents in the VS Code extension and the CLI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Planner agent&lt;/strong&gt; — explores the app, writes structured test plans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generator agent&lt;/strong&gt; — converts plans into test code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healer agent&lt;/strong&gt; — fixes failing tests with AI assistance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The release notes span three versions: &lt;a href="https://github.com/microsoft/playwright/releases/tag/v1.56.0" rel="noopener noreferrer"&gt;v1.56&lt;/a&gt; shipped the agents themselves, &lt;a href="https://github.com/microsoft/playwright/releases/tag/v1.58.0" rel="noopener noreferrer"&gt;v1.58&lt;/a&gt; shipped the token-efficient CLI (&lt;code&gt;playwright-cli&lt;/code&gt;), and &lt;a href="https://github.com/microsoft/playwright/releases/tag/v1.59.0" rel="noopener noreferrer"&gt;v1.59&lt;/a&gt; shipped the agent-facing APIs — &lt;a href="https://playwright.dev/docs/api/class-browser#browser-bind" rel="noopener noreferrer"&gt;&lt;code&gt;browser.bind()&lt;/code&gt;&lt;/a&gt; for MCP interop and &lt;a href="https://playwright.dev/docs/api/class-page#page-screencast" rel="noopener noreferrer"&gt;&lt;code&gt;page.screencast&lt;/code&gt;&lt;/a&gt; for video receipts. The naming and the split are what matter — Microsoft built the same architecture I built. They built it better in several specific ways and worse in one.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Microsoft got right
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Each agent is separate.&lt;/strong&gt; You can run Planner alone, pass its output to Generator, and never touch Healer. That separation is the whole point — an agent system where everything is entangled is just one big prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agents are optional.&lt;/strong&gt; You don't have to buy in all at once. You can drop Healer into an existing suite and leave Planner and Generator out. That's how adoption actually happens in enterprise shops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They shipped the infrastructure, not just the agents.&lt;/strong&gt; Two pieces matter here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://playwright.dev/docs/api/class-browser#browser-bind" rel="noopener noreferrer"&gt;&lt;code&gt;browser.bind()&lt;/code&gt;&lt;/a&gt; — added in v1.59. It exposes a running browser over a named pipe or websocket. Any MCP client can attach.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://chromewebstore.google.com/detail/playwright-mcp-bridge/mmlmfjhmonkocbjadbfplnigmagldckm" rel="noopener noreferrer"&gt;Playwright MCP Bridge&lt;/a&gt; — a free Chrome extension that connects your already-open tabs to a local Playwright MCP server. Your real cookies. Your real profile. Your real logged-in session. Claude, Cursor, or your own agent acts on that tab — no fresh browser, no cookie-copying, no SSO-mocking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, those two things do something QA teams have been hacking around for years: they let AI agents work on your actual authenticated browser instead of a fresh empty one. Microsoft built the plumbing. You don't have to.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the official implementation is still missing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The contract.&lt;/strong&gt; Self-healing is not a feature; it's a contract between the test, the app, and the team. The Healer agent will happily propose fixes — but who reviews them? Who owns the approval policy? Who escalates when the Healer's fix rate drops? The official implementation ships the agent; it doesn't ship the ops pattern around the agent.&lt;/p&gt;

&lt;p&gt;That ops pattern is the hard part. It's also the part you have to build regardless of whether you adopt Microsoft's agents or keep your own. A Healer without a review loop is just a regression generator with a nicer UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do (Whether or Not You Migrate)
&lt;/h2&gt;

&lt;p&gt;If you're running Playwright already, the path is obvious: try the Planner agent in VS Code next sprint. Feed it one real user story. Compare its output to what you'd have written. Repeat ten times. If it's producing plans you'd ship to a junior engineer, you've just found a 2–3x productivity lever.&lt;/p&gt;

&lt;p&gt;If you're on Selenium, Cypress, or something older, the migration math got better with v1.59 this month — but the pattern is portable. You don't need Microsoft's implementation to build this. You need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plans as artifacts.&lt;/strong&gt; Markdown. Version-controlled. Reviewable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generators with narrow context.&lt;/strong&gt; One plan in. One test file out. No repo-wide reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A healer with a review loop.&lt;/strong&gt; It proposes, a human approves, CI enforces. If the human always approves, your healer is working. If the human always rejects, your healer is broken. If it's 80/20, it's doing its job.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start with the Healer if you're drowning. Start with the Planner if you're understaffed. Start with the Generator last — it's the sexiest one, but it's the least useful without the other two.&lt;/p&gt;

&lt;p&gt;And if you're the AI QA Architect on a team that doesn't have this yet: this post is your new case study. Print it. Paste it in your design doc. Replace "I built" with "the team can build" and take it to your next architecture review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Three years ago this pattern was a weird thing a weird architect built because nothing off the shelf solved the problem.&lt;/p&gt;

&lt;p&gt;Last October it shipped as a native feature in the framework every serious web team uses. This week's v1.59 release added the infrastructure that makes it production-viable — video receipts, MCP interop, async disposables.&lt;/p&gt;

&lt;p&gt;If you're still treating flake as a writing problem, you're three years behind the curve. If you're treating it as an architecture problem, you're on the curve. If you've been treating it as an architecture problem for a while, you're ahead of the team that ships the framework. That's a fine place to be.&lt;/p&gt;

&lt;p&gt;The pattern worked then. It ships natively now — agents in v1.56, infrastructure in v1.59. The contract around it is still yours to build.&lt;/p&gt;

&lt;p&gt;That's the job.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Originally published at &lt;a href="https://www.anton.qa" rel="noopener noreferrer"&gt;https://www.anton.qa&lt;/a&gt; on April 23, 2026.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>automation</category>
      <category>javascript</category>
      <category>testing</category>
    </item>
    <item>
      <title>Create Video Receipts for AI Agents with Playwright Screencast API</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sat, 18 Apr 2026 01:40:47 +0000</pubDate>
      <link>https://dev.to/aiwithanton/create-video-receipts-for-ai-agents-with-playwright-screencast-api-1014</link>
      <guid>https://dev.to/aiwithanton/create-video-receipts-for-ai-agents-with-playwright-screencast-api-1014</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Playwright v1.59.0 ships the Screencast API, letting AI agents produce verifiable video evidence of their work. Engineers can replay agent actions with chapter markers and action annotations—no manual test replay required. Setup is three lines: start the screencast, run your agent logic, stop and save. This is the observability layer agentic workflows have been missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Release
&lt;/h2&gt;

&lt;p&gt;Playwright v1.59.0 dropped last week and the headline feature is the Screencast API. Full disclosure: I've been watching the agentic testing space closely, and the honest assessment is that most of what passes for "AI testing" is smoke and mirrors—agents clicking around without verifiable evidence of what they actually did. The Screencast API is different. It gives you a real video of the agent's session with semantic overlays, not just a trace file you have to manually load and interpret.&lt;/p&gt;

&lt;p&gt;The API surface is straightforward: &lt;code&gt;page.screencast.start()&lt;/code&gt; initiates recording and &lt;code&gt;page.screencast.stop()&lt;/code&gt; finalizes it. Between those calls, Playwright captures JPEG frames in real-time and lets you annotate them with chapter titles and action labels. You get a video file you can attach to a ticket, drop in a Slack thread, or store as audit evidence.&lt;/p&gt;

&lt;p&gt;This release also includes &lt;code&gt;browser.bind()&lt;/code&gt; for MCP integration, a CLI debugger, and async disposables—but for this post, I'm focusing on the Screencast API because it's the feature that directly addresses the verification problem in agentic workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for Engineers and QA
&lt;/h2&gt;

&lt;p&gt;If you're building or evaluating AI coding agents that interact with browsers, you face a fundamental trust problem. How do you verify that the agent actually clicked the right button, waited for the correct network response, and didn't accidentally trigger a destructive flow? Logs help, but they're not persuasive in a code review. Screenshots help more, but they don't capture temporal sequences well.&lt;/p&gt;

&lt;p&gt;Video receipts solve this. You get a playback of the full session with chapter markers at key decision points. Your PM can watch a 90-second clip instead of reading 200 lines of trace output. Your security team gets evidence they can archive. Your CI system gets an artifact to attach to the test report.&lt;/p&gt;

&lt;p&gt;For QA teams specifically, this changes the audit story. When a flaky test gets investigated, you currently spend 20-30 minutes reproducing the environment, loading traces, and reconstructing what happened. With a screencast, you open a video. That's a real workflow improvement, even if it's not a headline-grabbing metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;p&gt;Here's the implementation. The API supports chapter titles, action annotations, and visual overlays. You can configure frame capture rate and output format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;chromium&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;recordAgentSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;


  &lt;span class="c1"&gt;// Start screencast with chapter title&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./screencasts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`session-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;.webm`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;fps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Add chapter marker&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addChapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Login Flow&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;startTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Authentication&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Your agent logic goes here&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Username&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;testuser&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Password&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;password123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Sign In&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Add action annotation overlay&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;annotate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;action&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Clicked: Sign In&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;position&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Capture frame for AI vision processing&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;frame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;captureFrame&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Stop and finalize&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recording&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Recording saved:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recording&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;recordAgentSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://app.example.com/dashboard&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;captureFrame()&lt;/code&gt; method is what makes this useful for AI vision workflows. You pass the JPEG buffer to your vision model for validation or further processing. The agent produces the evidence; you decide what to do with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gotcha Nobody Is Talking About
&lt;/h2&gt;

&lt;p&gt;Here's what the release notes don't emphasize: screencast recording in headless mode is not pixel-perfect. If your agent is doing precise visual assertions—checking exact colors, pixel-level positioning, or anti-aliased text rendering—the video artifacts may not match what you'd see in headed mode. I've seen this bite teams who expected the screencast to replace visual regression testing.&lt;/p&gt;

&lt;p&gt;The API works correctly and the implementation is solid, but it's recording a compressed video, not a pixel-accurate capture of the render pipeline. Use it for workflow verification, not for asserting that #FF5733 exactly matches your design token. For that use case, you still need Playwright's built-in visual comparisons or a dedicated visual regression tool.&lt;/p&gt;

&lt;p&gt;Also worth noting: the output file can get large quickly. A 5-minute session at 15 fps with visual overlays will easily be 50-100MB. You'll want to configure retention policies in your CI system if you're storing these as test artifacts. Don't let this become your next storage incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Changes in Your CI Pipeline
&lt;/h2&gt;

&lt;p&gt;The immediate impact is on how you handle failures from AI-driven test agents. Currently, when an agent-authored test fails, you have two options: trust the agent's explanation (risky) or manually reproduce the failure (slow). With screencasts, you get a third option: watch the video, verify the agent's logic, and make an informed decision in under 60 seconds.&lt;/p&gt;

&lt;p&gt;In practice, this means fewer "cannot reproduce" situations in your backlog. The debugging loop tightens from hours to minutes. For teams running autonomous agents in CI—yes, that's a real thing—this is a meaningful improvement in the feedback cycle.&lt;/p&gt;

&lt;p&gt;Storage considerations aside, the integration is straightforward. Add &lt;code&gt;page.screencast.start()&lt;/code&gt; to your fixture setup, route failures to your screencast storage, and update your test reporters to embed video links. Your team will adapt faster than you expect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Notes
&lt;/h2&gt;

&lt;p&gt;No migration required for existing tests. The Screencast API is additive—if you're not calling &lt;code&gt;page.screencast.start()&lt;/code&gt;, your current suite is unaffected. The breaking change in this release is macOS 14 WebKit support removal, which only affects you if you're running WebKit on a 14-year-old OS. Update your browser matrix if that applies.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;@playwright/experimental-ct-svelte&lt;/code&gt; package removal is a non-issue unless you were explicitly depending on an experimental package—which you shouldn't be doing in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Playwright v1.59.0's Screencast API is the feature that makes agentic testing verifiable instead of mysterious. The implementation is clean, the API is intuitive, and the use case is real. It's not a replacement for visual regression tooling, and the storage costs are real, but the observability gains are genuine.&lt;/p&gt;

&lt;p&gt;If you're evaluating AI coding agents for test automation, this is the feature that makes the evaluation tractable. You can now watch what the agent did instead of trusting what the agent claims it did. That's not a small thing.&lt;/p&gt;

&lt;p&gt;I've shipped test tooling at scale, and the difference between "we have logs" and "we have video evidence" is the difference between debugging in the dark and debugging with a flashlight. The Screencast API gives you the flashlight. Worth exploring in your next sprint.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>testing</category>
    </item>
    <item>
      <title>Porting Anthropic's Skill Creator from Python to TypeScript</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Fri, 17 Apr 2026 16:01:00 +0000</pubDate>
      <link>https://dev.to/aiwithanton/porting-anthropics-skill-creator-from-python-to-typescript-57l8</link>
      <guid>https://dev.to/aiwithanton/porting-anthropics-skill-creator-from-python-to-typescript-57l8</guid>
      <description>&lt;p&gt;Anthropic's skill-creator for Claude Code is excellent. It introduced eval-driven development for AI agent skills — write a skill, test it with evals, optimize the description, benchmark the results. The methodology is proven.&lt;/p&gt;

&lt;p&gt;But it has a limitation: it only works with Claude Code, and skill access requires a paid subscription ($20/month minimum). Free tier users can't use it at all.&lt;/p&gt;

&lt;p&gt;OpenCode is free and supports 300+ models. I wanted to bring the same methodology to OpenCode users — for free, with no paywall.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-Level Architecture
&lt;/h2&gt;

&lt;p&gt;The original has this structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Anthropic skill-creator/
├── SKILL.md                    # The skill instructions
├── scripts/
│   ├── run_loop.py             # Eval→improve optimization loop
│   ├── improve_description.py  # LLM-powered description improvement
│   ├── aggregate_benchmark.py   # Benchmark aggregation
│   └── generate_review.py       # HTML report generation
└── evals/
    └── evals.json              # Test query definitions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;opencode-skill-creator/
├── skill-creator/              # The SKILL
│   ├── SKILL.md                # Main skill instructions
│   ├── agents/
│   │   ├── grader.md           # Assertion evaluation
│   │   ├── analyzer.md         # Benchmark analysis
│   │   └── comparator.md       # Blind A/B comparison
│   ├── references/
│   │   └── schemas.md          # JSON schema definitions
│   └── templates/
│       └── eval-review.html    # Eval set review/edit UI
└── plugin/                     # The PLUGIN (npm package)
    ├── package.json            # npm package metadata
    ├── skill-creator.ts         # Entry point
    └── lib/
        ├── utils.ts            # SKILL.md frontmatter parsing
        ├── validate.ts          # Skill structure validation
        ├── run-eval.ts          # Trigger evaluation
        ├── improve-description.ts  # Description optimization
        ├── run-loop.ts          # Eval→improve loop
        ├── aggregate.ts         # Benchmark aggregation
        ├── report.ts            # HTML report generation
        └── review-server.ts     # HTTP eval review server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key difference: the skill provides workflow knowledge, the plugin provides executable tools. The agent orchestrates everything by calling tools during its session.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 1: Scripts → Plugin Tool Calls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt; Python scripts invoked via CLI&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Run the optimization loop
&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;scripts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_loop&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;skill&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="nb"&gt;eval&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="n"&gt;evals&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;New:&lt;/strong&gt; Plugin tool calls in OpenCode sessions&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;skill_optimize_loop with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;evalSetPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/path/to/evals.json&lt;/span&gt;
  &lt;span class="na"&gt;skillPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/path/to/skill&lt;/span&gt;
  &lt;span class="na"&gt;maxIterations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why: OpenCode's plugin architecture lets agents call custom tools directly. No subprocess management, no script execution, no Python environment. The agent calls the tool inline and gets results back in the session.&lt;/p&gt;

&lt;p&gt;This is cleaner integration but also more composable. The agent can interleave tool calls with other work — read files, ask the user questions, make decisions — between optimization iterations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 2: Python → TypeScript
&lt;/h2&gt;

&lt;p&gt;The original requires Python 3.11+ and pyyaml. My version requires nothing beyond Node.js (which OpenCode users already have).&lt;/p&gt;

&lt;p&gt;All pipeline components — validation, eval, description improvement, loop runner, aggregation, report generation, review server — are TypeScript modules in the plugin. ~256kB unpacked on npm.&lt;/p&gt;

&lt;p&gt;Dependency tree is minimal: the plugin only depends on &lt;code&gt;@opencode-ai/plugin&lt;/code&gt; (peer dependency).&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 3: Static HTML → HTTP Review Server
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt; Python script generates a static HTML file and opens it in the browser.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;generate_review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;
&lt;span class="c1"&gt;# Opens /path/to/workspace/review.html in browser
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;New:&lt;/strong&gt; Plugin starts a local HTTP server that serves an interactive eval viewer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;skill_serve_review&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nx"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;workspace&lt;/span&gt;
  &lt;span class="nx"&gt;skillName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-skill&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HTTP server approach has advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time updates when new eval results come in&lt;/li&gt;
&lt;li&gt;Interactive review with save buttons that write feedback back to files&lt;/li&gt;
&lt;li&gt;Previous/next navigation between eval cases&lt;/li&gt;
&lt;li&gt;Benchmark tab with quantitative metrics&lt;/li&gt;
&lt;li&gt;No file management — just open localhost:PORT&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The server can also generate static HTML for headless environments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;skill_export_static_review&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nx"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;workspace&lt;/span&gt;
  &lt;span class="nx"&gt;outputPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Decision 4: Subagents → Task Tool
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt; Claude Code's built-in subagent concept, where the skill directly spawns sub-agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New:&lt;/strong&gt; OpenCode's Task tool with &lt;code&gt;general&lt;/code&gt; and &lt;code&gt;explore&lt;/code&gt; subagent types. The SKILL.md instructs the agent to spawn tasks for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running eval cases (with-skill and baseline)&lt;/li&gt;
&lt;li&gt;Grading assertions against outputs&lt;/li&gt;
&lt;li&gt;Analyzing benchmark results&lt;/li&gt;
&lt;li&gt;Blind A/B comparison between skill versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent orchestrates these tasks and synthesizes their results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 5: Staging Outside the Repo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt; Evals and benchmarks run alongside the skill in the same directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New:&lt;/strong&gt; Draft skills and eval artifacts go to the system temp directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/tmp/opencode-skills/&amp;lt;skill-name&amp;gt;/           # Staged skill
/tmp/opencode-skills/&amp;lt;skill-name&amp;gt;-workspace/  # Eval artifacts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only the final validated skill gets installed to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project: &lt;code&gt;.opencode/skills/&amp;lt;skill-name&amp;gt;/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Global: &lt;code&gt;~/.config/opencode/skills/&amp;lt;skill-name&amp;gt;/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps the user's repository clean during skill development. Evals create a lot of artifacts (outputs, timing data, grading results, benchmark files) that you don't want mixed into your project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 6: Strict Review Workflow
&lt;/h2&gt;

&lt;p&gt;Added a "review workflow guard" that enforces paired comparison data by default:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;skill_serve_review&lt;/code&gt; and &lt;code&gt;skill_export_static_review&lt;/code&gt; require each eval directory to include both &lt;code&gt;with_skill&lt;/code&gt; AND baseline (&lt;code&gt;without_skill&lt;/code&gt; or &lt;code&gt;old_skill&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;If pairs are missing, the tools fail fast with a clear list of what's missing&lt;/li&gt;
&lt;li&gt;Override with &lt;code&gt;allowPartial: true&lt;/code&gt; only when intentionally reviewing incomplete data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents a common mistake: reviewing eval results without a baseline comparison, which makes it impossible to judge whether the skill actually improved anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Skills are software
&lt;/h3&gt;

&lt;p&gt;They need testing, not just writing. The eval-driven approach catches issues you'd never find manually — like a description that triggers on 80% of relevant queries but also fires on 30% of irrelevant ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Description optimization matters more than skill content
&lt;/h3&gt;

&lt;p&gt;The description field is the primary triggering mechanism. A well-optimized description on an average skill outperforms a poor description on a perfect skill. This is counterintuitive but matches the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Train/test splits prevent overfitting
&lt;/h3&gt;

&lt;p&gt;Same lesson as ML hyperparameter tuning. If you only evaluate on the queries you optimize for, descriptions become overfit. The 60/40 split keeps you honest about generalization.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Human-in-the-loop review is essential
&lt;/h3&gt;

&lt;p&gt;Automation measures triggering accuracy, but humans judge output quality. The visual eval viewer puts outputs side by side so you can see whether the skill produces genuinely useful results, not just correctly-triggered results.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Plugin architecture enables composition
&lt;/h3&gt;

&lt;p&gt;Having eval, benchmarking, and review as separate tool calls (instead of a monolithic script) means the agent can interleave them with other work. It can ask the user a question between iterations, read relevant files during eval, or skip steps the user doesn't need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache 2.0, free, open source. Works with any of OpenCode's supported models.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;https://github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;br&gt;
npm: &lt;a href="https://www.npmjs.com/package/opencode-skill-creator" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: &lt;code&gt;npx opencode-skill-creator install --global&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opencode</category>
      <category>typescript</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Wed, 15 Apr 2026 18:20:22 +0000</pubDate>
      <link>https://dev.to/aiwithanton/i-ate-my-own-dog-food-how-i-benchmarked-ai-skills-and-proved-eval-driven-development-works-c0l</link>
      <guid>https://dev.to/aiwithanton/i-ate-my-own-dog-food-how-i-benchmarked-ai-skills-and-proved-eval-driven-development-works-c0l</guid>
      <description>&lt;p&gt;I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me.&lt;/p&gt;

&lt;p&gt;As a QA architect, I've spent my career building systems that verify software works correctly. At Apple, we tested everything — every interaction, every edge case, every regression. At CooperVision, I built a Playwright/TypeScript framework from scratch that grew test coverage by 300%.&lt;/p&gt;

&lt;p&gt;So when I started working with AI agent skills, I noticed something: &lt;strong&gt;nobody was testing them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You write a SKILL.md file. You try it manually once. Maybe it works for your prompt. You ship it.&lt;/p&gt;

&lt;p&gt;There's no automated test suite. No regression testing. No CI pipeline that catches when a description change breaks triggering.&lt;/p&gt;

&lt;p&gt;That's a QA problem. I built &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;opencode-skill-creator&lt;/a&gt; to solve it.&lt;/p&gt;

&lt;p&gt;Then I dogfooded it on a real project. Here's what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Project: AdLoop Skills for Google Ads
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kLOsk/adloop" rel="noopener noreferrer"&gt;AdLoop&lt;/a&gt; is a Google Ads MCP (Model Context Protocol) integration — it connects AI agents to Google Ads and GA4 data through a set of tools.&lt;/p&gt;

&lt;p&gt;I created 4 skills for AdLoop using opencode-skill-creator:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;adloop-planning&lt;/strong&gt; — Keyword research, competition analysis, and budget forecasting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-read&lt;/strong&gt; — Performance analysis, campaign reporting, and conversion diagnostics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-write&lt;/strong&gt; — Campaign creation, ad management, keyword bidding, and budget changes (spends real money)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-tracking&lt;/strong&gt; — GA4 event validation, conversion tracking diagnosis, and code generation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Benchmark Results
&lt;/h2&gt;

&lt;p&gt;opencode-skill-creator's benchmark runs each skill through its eval queries in two configurations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;With skill loaded&lt;/strong&gt; — the AI has full domain knowledge, safety rules, and orchestration patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Without skill&lt;/strong&gt; — the AI only has bare MCP tool names and descriptions&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Evals&lt;/th&gt;
&lt;th&gt;With Skill&lt;/th&gt;
&lt;th&gt;Without Skill&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;adloop-write&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;17%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+83pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-planning&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;21%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+79pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-read&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;27%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+73pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-tracking&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;33%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+67pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But the raw numbers only tell part of the story. The &lt;em&gt;failures&lt;/em&gt; without skills aren't just wrong answers — they're dangerous actions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scariest Failure: Real Money at Stake
&lt;/h2&gt;

&lt;p&gt;adloop-write manages campaigns, ads, keywords, and budgets — operations that &lt;strong&gt;spend real money&lt;/strong&gt;. Without the skill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Added BROAD match keywords to MANUAL_CPC campaigns&lt;/strong&gt; — the #1 cause of wasted ad spend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set budget above safety caps&lt;/strong&gt; ($100 when max is $50) — no guardrail&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deleted campaigns irreversibly without warning&lt;/strong&gt; — no confirmation, no pause alternative&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batched multiple changes in one call&lt;/strong&gt; — bypassing review steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't about "better answers." This is about &lt;strong&gt;preventing real financial harm&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  GDPR ≠ Broken Tracking
&lt;/h2&gt;

&lt;p&gt;A common scenario: 500 clicks in Google Ads, 180 sessions in GA4. "Is my tracking broken?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without the skill&lt;/strong&gt;, AI diagnosed this as a tracking issue and offered to investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With the skill&lt;/strong&gt;, AI recognized: "A 2.8:1 ratio is normal with GDPR consent banners. Google Ads counts all clicks. GA4 only counts consenting users. Your tracking is fine."&lt;/p&gt;

&lt;p&gt;The #1 false positive in digital marketing analytics, prevented by domain knowledge in the skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't Trust Google Blindly
&lt;/h2&gt;

&lt;p&gt;Without the skill, AI endorsed Google's recommendations at face value: "Raise budget" with zero conversions. "Add BROAD match" without Smart Bidding.&lt;/p&gt;

&lt;p&gt;The skill explicitly states: &lt;strong&gt;"Google recommendations optimize for Google's revenue, not yours."&lt;/strong&gt; It cross-references against conversion data first. The 73% improvement comes from teaching critical thinking, not compliance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The same AI model. The same tools. The same prompts. The only variable: whether the skill is loaded. The difference is 67–83 percentage points.&lt;/p&gt;

&lt;p&gt;Skills do three things bare tool access doesn't:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inject domain expertise&lt;/strong&gt; — GDPR mechanics, budget rules, competition levels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforce safety guardrails&lt;/strong&gt; — budget caps, deletion warnings, one-change-at-a-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provide orchestration patterns&lt;/strong&gt; — when to call which tool, in what order, with what validation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free, open source (Apache 2.0). Works with any of OpenCode's 300+ supported models. Pure TypeScript, zero Python dependency.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Skills are software. Software should be tested.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test. Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>qa</category>
      <category>opensource</category>
      <category>openclaw</category>
    </item>
  </channel>
</rss>
