<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anton Gulin</title>
    <description>The latest articles on DEV Community by Anton Gulin (@aiwithanton).</description>
    <link>https://dev.to/aiwithanton</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872452%2F17f47297-ddc6-457c-9920-47c0dd1acd1b.png</url>
      <title>DEV Community: Anton Gulin</title>
      <link>https://dev.to/aiwithanton</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aiwithanton"/>
    <language>en</language>
    <item>
      <title>Playwright MCP v0.0.73: How to Configure Browser Paths via Environment Variables</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 04 May 2026 21:37:17 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-mcp-v0073-how-to-configure-browser-paths-via-environment-variables-3fap</link>
      <guid>https://dev.to/aiwithanton/playwright-mcp-v0073-how-to-configure-browser-paths-via-environment-variables-3fap</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This post was originally published on &lt;a href="https://www.anton.qa/blog/posts/playwright-mcp-v0-0-73" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;. The canonical version lives there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Playwright MCP v0.0.73 fixes a critical gap where extension channels and executable paths could not be resolved from CI/CD environment variables.&lt;/p&gt;

&lt;p&gt;If you run Playwright MCP in Docker, Kubernetes, or ephemeral CI workers, this release removes a class of environment-specific debugging that typically consumes 15–30 minutes per incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;Two interconnected bug fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extension &lt;code&gt;channel&lt;/code&gt; and &lt;code&gt;executablePath&lt;/code&gt; now resolve from CLI flags and environment variables&lt;/strong&gt; (&lt;a href="https://github.com/microsoft/playwright/pull/40572" rel="noopener noreferrer"&gt;#40572&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--browser&lt;/code&gt; channel flags now propagate on &lt;code&gt;--extension&lt;/code&gt; paths&lt;/strong&gt; (&lt;a href="https://github.com/microsoft/playwright/pull/40567" rel="noopener noreferrer"&gt;#40567&lt;/a&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Combined, these changes mean your Playwright MCP setup can now be fully environment-driven.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PLAYWRIGHT_BROWSERS_CHANNEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;chromium
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PLAYWRIGHT_EXTENSION_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/browser-extension
npx playwright &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resolution hierarchy is now:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CLI flags (highest priority)&lt;/li&gt;
&lt;li&gt;Environment variables&lt;/li&gt;
&lt;li&gt;Config file defaults&lt;/li&gt;
&lt;li&gt;Built-in channel defaults&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  MCP Registry listing
&lt;/h2&gt;

&lt;p&gt;Playwright MCP is now published to the official &lt;a href="https://registry.modelcontextprotocol.io" rel="noopener noreferrer"&gt;MCP Registry&lt;/a&gt; on each release. This simplifies enterprise procurement and governance for teams evaluating AI-assisted testing infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha
&lt;/h2&gt;

&lt;p&gt;Environment variables set in your shell may not propagate to the MCP process spawned by your AI tool. Test this before deploying to production.&lt;/p&gt;

&lt;p&gt;For the full breakdown — including CI/CD examples and the subprocess propagation fix — read the canonical post at &lt;a href="https://www.anton.qa/blog/posts/playwright-mcp-v0-0-73" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is an AI QA Architect. Former Apple SDET, now Lead Software Engineer in Test. &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>cicd</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Native Drag-and-Drop Automation Arrives in Playwright MCP</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Tue, 28 Apr 2026 18:43:12 +0000</pubDate>
      <link>https://dev.to/aiwithanton/native-drag-and-drop-automation-arrives-in-playwright-mcp-3e16</link>
      <guid>https://dev.to/aiwithanton/native-drag-and-drop-automation-arrives-in-playwright-mcp-3e16</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Playwright MCP v0.0.71 ships &lt;code&gt;browser_drop&lt;/code&gt;. It gives you native drag-and-drop from any MCP client. No more &lt;code&gt;evaluate&lt;/code&gt; scripts. No more &lt;code&gt;mouse.move&lt;/code&gt; chains. Grid reordering, file drop zones, text editor drags — all work the same way a real user does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Release Matters
&lt;/h2&gt;

&lt;p&gt;QA teams either abandon drag-and-drop testing or hack around it. But sortable grids, file uploads, and rich text editors are everywhere. And they have been painful to test forever.&lt;/p&gt;

&lt;p&gt;I ran into this firsthand on one project. Solid Playwright coverage for clicks, typing, and navigation. But drag-and-drop? We used &lt;code&gt;evaluate&lt;/code&gt; scripts. Or we tested it by hand. Both paths broke across browsers. Both were impossible to keep working.&lt;/p&gt;

&lt;p&gt;Playwright MCP v0.0.71 fixes this with &lt;code&gt;browser_drop&lt;/code&gt;. It uses Playwright's own &lt;code&gt;Locator.drop&lt;/code&gt; — the same API your tests already use. Now any MCP client can call it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use browser_drop
&lt;/h2&gt;

&lt;p&gt;Here's a complete example combining browser_drop with the new response body capture from browser_network_requests and the simplified expression support in browser_evaluate. This pipeline automates a file upload scenario, validates the server response, and confirms the UI state update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;McpServer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/mcp.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;McpServer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;file-upload-automation&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Drop zone and file item selectors for a document management UI&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropZoneSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="upload-zone"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileItemSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="file-item"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;uploadedStatusSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="upload-status"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Tool: Simulate file drag-and-drop onto upload zone&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;upload_document_flow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Upload a document via drag-and-drop and validate response&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Name of file to upload&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;fileId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Unique file identifier&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Navigate to upload interface&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_navigate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://internal-docs.example.com/upload&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Locate drag source and drop target&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dragSource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`text=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropTarget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dropZoneSelector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Execute native drag-and-drop operation&lt;/span&gt;
    &lt;span class="c1"&gt;// browser_drop wraps Locator.drop - no evaluate or mouse.move workarounds&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_drop&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dragSource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dropTarget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;dropResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Drop operation failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;dropResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Inspect server response body with mime-type detection&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;networkCapture&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_network_requests&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;urlPattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;**/api/upload**&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;responseHeaders&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Extract upload confirmation&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;uploadResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;networkCapture&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;No upload response captured&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Validate response using plain expression (no function wrapper needed)&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;validationResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`JSON.parse(arguments[0]).status === "success"`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Confirm UI reflects successful upload&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;statusText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`document.querySelector("&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;uploadedStatusSelector&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;")?.textContent`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;uploaded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;serverResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;uiStatus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;statusText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;validationPassed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;validationResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three new tools working together: &lt;code&gt;browser_drop&lt;/code&gt; handles the drag. &lt;code&gt;browser_network_requests&lt;/code&gt; captures the server response (full body, not just status codes). &lt;code&gt;browser_evaluate&lt;/code&gt; runs plain JavaScript — no function wrapper needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gotcha Nobody Is Talking About
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;browser_drop&lt;/code&gt; needs both elements to be on screen. That's correct Playwright behavior. But here's the catch: if you navigate to a page and the drag target sits below the fold, the drop fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Call browser_evaluate to scroll the target into view before calling browser_drop, or use the scroll option if your Playwright version supports it. This catches teams off guard in CI where viewport sizes are smaller than local development.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before browser_drop: ensure target is in viewport&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`document.querySelector("&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;dropTarget&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;").scrollIntoView()`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a bug. It's how Playwright works. But it catches teams when they test on a big screen and deploy to CI. CI viewports are smaller. The element you tested locally is off screen in the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Changes in Your CI Pipeline
&lt;/h2&gt;

&lt;p&gt;With &lt;code&gt;browser_drop&lt;/code&gt;, you can test drag-and-drop flows through MCP. Not by hand. Not with broken scripts.&lt;/p&gt;

&lt;p&gt;On one project, Selenium to Playwright gave us 40% faster tests. But drag-and-drop still broke in headless mode. We wrote &lt;code&gt;evaluate&lt;/code&gt; scripts. They stopped working every sprint. &lt;code&gt;browser_drop&lt;/code&gt; puts native drag-and-drop into MCP. No scripts. No workarounds.&lt;/p&gt;

&lt;p&gt;What this actually means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fewer flaky tests.&lt;/strong&gt; Native drag-and-drop is tested across browsers. &lt;code&gt;evaluate&lt;/code&gt; + &lt;code&gt;mouse.move&lt;/code&gt; sequences are not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simpler AI test generation.&lt;/strong&gt; AI tools call &lt;code&gt;browser_drop&lt;/code&gt; directly. No fragile mouse chains.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster CI.&lt;/strong&gt; Native operations run faster than JavaScript-injected drag scripts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Playwright MCP v0.0.71 is worth upgrading for &lt;code&gt;browser_drop&lt;/code&gt; alone. The response body capture and plain expression support make it better. But drag-and-drop was the missing piece. Now it's there.&lt;/p&gt;

&lt;p&gt;The catch is real but small. Scroll your target into view before you drop. One line. Add it to your tool definitions and move on.&lt;/p&gt;

&lt;p&gt;If you run MCP-based test infrastructure, this kills the last reason to fall back to &lt;code&gt;evaluate&lt;/code&gt; for drag-and-drop. Upgrade. Add the scroll guard. Ship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference&lt;/strong&gt;: &lt;a href="https://playwright.dev/docs/api/class-locator#locator-drop" rel="noopener noreferrer"&gt;Playwright Locator.drop API documentation&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, now Lead Software Engineer in Test. Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>mcp</category>
      <category>testing</category>
    </item>
    <item>
      <title>Playwright Just Shipped the Fix For Flaky Tests I Built 3 Years Ago</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Fri, 24 Apr 2026 19:46:48 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-just-shipped-the-fix-for-flaky-tests-i-built-3-years-ago-56nf</link>
      <guid>https://dev.to/aiwithanton/playwright-just-shipped-the-fix-for-flaky-tests-i-built-3-years-ago-56nf</guid>
      <description>&lt;p&gt;I shipped a self-healing test framework three years ago. Nobody called it agentic then. The word "agent" was what your antivirus company ran on your laptop.&lt;/p&gt;

&lt;p&gt;I called my three internal components Planner, Generator, and Healer. Not because I'd read a paper — because those were the three jobs the pipeline needed and I was out of clever names.&lt;/p&gt;

&lt;p&gt;Last October, Playwright v1.56 shipped native Test Agents. Three of them.&lt;/p&gt;

&lt;p&gt;They're called &lt;strong&gt;Planner&lt;/strong&gt;, &lt;strong&gt;Generator&lt;/strong&gt;, and &lt;strong&gt;Healer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This week's v1.59 release added the infrastructure that makes the three-role pattern actually viable in production — video receipts via &lt;code&gt;page.screencast&lt;/code&gt;, MCP interop via &lt;code&gt;browser.bind()&lt;/code&gt;, and async disposables for clean resource management. The agents shipped in October. The AI test automation architecture they need shipped last week.&lt;/p&gt;

&lt;p&gt;So this is a post about a pattern that just got validated by the team that ships the framework I bet my career on. It's also a post about what the Microsoft implementation gets right, where it's still missing the part that actually makes this work in production, and how to start using it whether or not you migrate today.&lt;/p&gt;

&lt;p&gt;If you're a QA architect, test lead, or SDET who's ever been told to "just make the flaky tests pass" — this one's for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: the flake tax nobody budgets for
&lt;/h2&gt;

&lt;p&gt;Here's a number every engineering manager underestimates: the flake tax.&lt;/p&gt;

&lt;p&gt;On a team I worked with years ago — mid-stage B2B SaaS, 12 engineers, 8 services — the suite had about 1,200 end-to-end tests. Roughly 4% flaked per run. Sounds tolerable. It wasn't.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4% flake × 20 PR runs per day = ~1,000 spurious failures per week&lt;/li&gt;
&lt;li&gt;Every spurious failure triggers a re-run, a triage, a Slack thread&lt;/li&gt;
&lt;li&gt;On a good week, 3 engineers burned a full day each chasing ghosts&lt;/li&gt;
&lt;li&gt;On a bad week (release freeze, CI degradation, upstream flake) it could take the whole team for 2 sprints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the flake tax. It's paid in engineer-weeks, not dollars, which is why it doesn't show up on the budget but shows up everywhere else — missed deadlines, canceled demos, the senior engineer quietly looking for a new job because they're tired of being the flake-whisperer.&lt;/p&gt;

&lt;p&gt;The traditional fix is discipline: write better locators, wait on the right events, don't trust the backend, quarantine flakes, review the quarantine weekly, blah blah. All true. All inadequate at scale. Discipline is linear; flake is exponential.&lt;/p&gt;

&lt;p&gt;Eventually I stopped fighting flake as a writer and started designing against it as an architect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Drama: the 2-week death-march that broke me
&lt;/h2&gt;

&lt;p&gt;I won't name the company or the release. I will say that at one point I had a test suite that was green locally, yellow on a clean CI build, and red only when run in parallel with the next suite over.&lt;/p&gt;

&lt;p&gt;The failure was non-deterministic. The reproduction wasn't. It happened every Tuesday between 10:14 AM and 10:22 AM.&lt;/p&gt;

&lt;p&gt;We lost two weeks to it. I tried everything. I tried everything again. I tried everything in a different order. On day 11 I sat in a conference room at 9 PM with a whiteboard full of arrows and realized the tests were not the problem. The &lt;em&gt;test infrastructure&lt;/em&gt; was the problem. My framework assumed the application was the only thing being tested. It wasn't. The CI runner was being tested too. So was the database snapshot restore job. So was the deployment timing on the staging environment.&lt;/p&gt;

&lt;p&gt;We fixed that specific bug. But the death-march taught me the thing I'd refused to see: &lt;strong&gt;test maintenance is not a writing problem. It's an architecture problem.&lt;/strong&gt; The tests don't need more discipline. The framework around them needs more intelligence.&lt;/p&gt;

&lt;p&gt;That's where the three-role pattern was born.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: An AI Test Automation Architecture in Three Roles
&lt;/h2&gt;

&lt;p&gt;Here's the pattern, condensed. The names are mine, but the ideas were obvious once I stopped pretending they weren't separate jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Planner
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Job:&lt;/strong&gt; given a feature, a user story, or a production incident, produce a structured test plan. Not test code — a plan. A list of flows, edge cases, pre-conditions, cleanup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; planning and writing are different skills. If one component does both jobs, tests drift from plans. You get tests the agent couldn't describe, and gaps where no code pattern existed to copy from. Planning first forces completeness before cleverness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I built three years ago:&lt;/strong&gt; a template-driven plan generator that read from PR descriptions, Jira tickets, and production alerts, and produced a Markdown spec engineers reviewed before any code was written. Approval rate on plans was ~85%, and the rejected 15% were caught in minutes, not days of debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generator
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Job:&lt;/strong&gt; take an approved plan and produce the test code. Choose the locators, write the assertions, set up the fixtures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; code generation benefits from narrow context (the plan), not broad context (the whole codebase). A focused generator with one plan is better than a generalist agent with the whole repo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I built:&lt;/strong&gt; a generator that output Playwright/TypeScript tests from plan Markdown, with locator strategies (data-testid preferred, role-based fallback, text-based last-resort), fixture scaffolding, and soft-assertion patterns baked in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Healer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Job:&lt;/strong&gt; when a test fails, diagnose whether the failure is real (app bug), structural (locator stale after a UI refactor), or environmental (CI flake). Fix the structural ones. Flag the real ones. Quarantine the environmental ones with context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's separate:&lt;/strong&gt; and this is the one nobody wanted to believe at the time — healing is not about re-running failed tests until they pass. That's not healing; that's hiding. Real healing is triage plus targeted mutation plus review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I built:&lt;/strong&gt; a healer that diffed the current DOM against the last green run, proposed three locator candidates when the old one was stale, scored each against a stability heuristic, and opened a PR with the one-line change for a human to review. Merge rate on healer PRs was ~80%. The other 20% were caught in review, which is exactly what a healer is supposed to look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;I don't love citing numbers without naming the shop, but my feedback memory is explicit on that. So here's what's defensible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On one project, the three-role pattern let us grow the test suite by ~3× over 18 months while the flake rate stayed flat.&lt;/li&gt;
&lt;li&gt;On another, we cut the test-maintenance time-per-engineer by more than a third in the first quarter after rollout.&lt;/li&gt;
&lt;li&gt;On a third, the Healer caught a UI-refactor regression pattern (100+ tests stale from a single CSS rename) and produced a single-PR fix overnight. The alternative would have been a 3-week cleanup sprint.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These numbers are not magic. They are the mechanical consequence of separating concerns and instrumenting the boundary between them. If you already do this with your services in production, you already know why it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now Playwright Ships It Natively
&lt;/h2&gt;

&lt;p&gt;Playwright v1.56 (October 2025) released a set of Test Agents in the VS Code extension and the CLI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Planner agent&lt;/strong&gt; — explores the app, writes structured test plans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generator agent&lt;/strong&gt; — converts plans into test code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healer agent&lt;/strong&gt; — fixes failing tests with AI assistance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The release notes span three versions: &lt;a href="https://github.com/microsoft/playwright/releases/tag/v1.56.0" rel="noopener noreferrer"&gt;v1.56&lt;/a&gt; shipped the agents themselves, &lt;a href="https://github.com/microsoft/playwright/releases/tag/v1.58.0" rel="noopener noreferrer"&gt;v1.58&lt;/a&gt; shipped the token-efficient CLI (&lt;code&gt;playwright-cli&lt;/code&gt;), and &lt;a href="https://github.com/microsoft/playwright/releases/tag/v1.59.0" rel="noopener noreferrer"&gt;v1.59&lt;/a&gt; shipped the agent-facing APIs — &lt;a href="https://playwright.dev/docs/api/class-browser#browser-bind" rel="noopener noreferrer"&gt;&lt;code&gt;browser.bind()&lt;/code&gt;&lt;/a&gt; for MCP interop and &lt;a href="https://playwright.dev/docs/api/class-page#page-screencast" rel="noopener noreferrer"&gt;&lt;code&gt;page.screencast&lt;/code&gt;&lt;/a&gt; for video receipts. The naming and the split are what matter — Microsoft built the same architecture I built. They built it better in several specific ways and worse in one.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Microsoft got right
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Each agent is separate.&lt;/strong&gt; You can run Planner alone, pass its output to Generator, and never touch Healer. That separation is the whole point — an agent system where everything is entangled is just one big prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agents are optional.&lt;/strong&gt; You don't have to buy in all at once. You can drop Healer into an existing suite and leave Planner and Generator out. That's how adoption actually happens in enterprise shops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They shipped the infrastructure, not just the agents.&lt;/strong&gt; Two pieces matter here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://playwright.dev/docs/api/class-browser#browser-bind" rel="noopener noreferrer"&gt;&lt;code&gt;browser.bind()&lt;/code&gt;&lt;/a&gt; — added in v1.59. It exposes a running browser over a named pipe or websocket. Any MCP client can attach.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://chromewebstore.google.com/detail/playwright-mcp-bridge/mmlmfjhmonkocbjadbfplnigmagldckm" rel="noopener noreferrer"&gt;Playwright MCP Bridge&lt;/a&gt; — a free Chrome extension that connects your already-open tabs to a local Playwright MCP server. Your real cookies. Your real profile. Your real logged-in session. Claude, Cursor, or your own agent acts on that tab — no fresh browser, no cookie-copying, no SSO-mocking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, those two things do something QA teams have been hacking around for years: they let AI agents work on your actual authenticated browser instead of a fresh empty one. Microsoft built the plumbing. You don't have to.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the official implementation is still missing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The contract.&lt;/strong&gt; Self-healing is not a feature; it's a contract between the test, the app, and the team. The Healer agent will happily propose fixes — but who reviews them? Who owns the approval policy? Who escalates when the Healer's fix rate drops? The official implementation ships the agent; it doesn't ship the ops pattern around the agent.&lt;/p&gt;

&lt;p&gt;That ops pattern is the hard part. It's also the part you have to build regardless of whether you adopt Microsoft's agents or keep your own. A Healer without a review loop is just a regression generator with a nicer UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do (Whether or Not You Migrate)
&lt;/h2&gt;

&lt;p&gt;If you're running Playwright already, the path is obvious: try the Planner agent in VS Code next sprint. Feed it one real user story. Compare its output to what you'd have written. Repeat ten times. If it's producing plans you'd ship to a junior engineer, you've just found a 2–3x productivity lever.&lt;/p&gt;

&lt;p&gt;If you're on Selenium, Cypress, or something older, the migration math got better with v1.59 this month — but the pattern is portable. You don't need Microsoft's implementation to build this. You need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plans as artifacts.&lt;/strong&gt; Markdown. Version-controlled. Reviewable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generators with narrow context.&lt;/strong&gt; One plan in. One test file out. No repo-wide reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A healer with a review loop.&lt;/strong&gt; It proposes, a human approves, CI enforces. If the human always approves, your healer is working. If the human always rejects, your healer is broken. If it's 80/20, it's doing its job.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start with the Healer if you're drowning. Start with the Planner if you're understaffed. Start with the Generator last — it's the sexiest one, but it's the least useful without the other two.&lt;/p&gt;

&lt;p&gt;And if you're the AI QA Architect on a team that doesn't have this yet: this post is your new case study. Print it. Paste it in your design doc. Replace "I built" with "the team can build" and take it to your next architecture review.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Three years ago this pattern was a weird thing a weird architect built because nothing off the shelf solved the problem.&lt;/p&gt;

&lt;p&gt;Last October it shipped as a native feature in the framework every serious web team uses. This week's v1.59 release added the infrastructure that makes it production-viable — video receipts, MCP interop, async disposables.&lt;/p&gt;

&lt;p&gt;If you're still treating flake as a writing problem, you're three years behind the curve. If you're treating it as an architecture problem, you're on the curve. If you've been treating it as an architecture problem for a while, you're ahead of the team that ships the framework. That's a fine place to be.&lt;/p&gt;

&lt;p&gt;The pattern worked then. It ships natively now — agents in v1.56, infrastructure in v1.59. The contract around it is still yours to build.&lt;/p&gt;

&lt;p&gt;That's the job.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Originally published at &lt;a href="https://www.anton.qa" rel="noopener noreferrer"&gt;https://www.anton.qa&lt;/a&gt; on April 23, 2026.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>automation</category>
      <category>javascript</category>
      <category>testing</category>
    </item>
    <item>
      <title>Create Video Receipts for AI Agents with Playwright Screencast API</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sat, 18 Apr 2026 01:40:47 +0000</pubDate>
      <link>https://dev.to/aiwithanton/create-video-receipts-for-ai-agents-with-playwright-screencast-api-1014</link>
      <guid>https://dev.to/aiwithanton/create-video-receipts-for-ai-agents-with-playwright-screencast-api-1014</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Playwright v1.59.0 ships the Screencast API, letting AI agents produce verifiable video evidence of their work. Engineers can replay agent actions with chapter markers and action annotations—no manual test replay required. Setup is three lines: start the screencast, run your agent logic, stop and save. This is the observability layer agentic workflows have been missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Release
&lt;/h2&gt;

&lt;p&gt;Playwright v1.59.0 dropped last week and the headline feature is the Screencast API. Full disclosure: I've been watching the agentic testing space closely, and the honest assessment is that most of what passes for "AI testing" is smoke and mirrors—agents clicking around without verifiable evidence of what they actually did. The Screencast API is different. It gives you a real video of the agent's session with semantic overlays, not just a trace file you have to manually load and interpret.&lt;/p&gt;

&lt;p&gt;The API surface is straightforward: &lt;code&gt;page.screencast.start()&lt;/code&gt; initiates recording and &lt;code&gt;page.screencast.stop()&lt;/code&gt; finalizes it. Between those calls, Playwright captures JPEG frames in real-time and lets you annotate them with chapter titles and action labels. You get a video file you can attach to a ticket, drop in a Slack thread, or store as audit evidence.&lt;/p&gt;

&lt;p&gt;This release also includes &lt;code&gt;browser.bind()&lt;/code&gt; for MCP integration, a CLI debugger, and async disposables—but for this post, I'm focusing on the Screencast API because it's the feature that directly addresses the verification problem in agentic workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for Engineers and QA
&lt;/h2&gt;

&lt;p&gt;If you're building or evaluating AI coding agents that interact with browsers, you face a fundamental trust problem. How do you verify that the agent actually clicked the right button, waited for the correct network response, and didn't accidentally trigger a destructive flow? Logs help, but they're not persuasive in a code review. Screenshots help more, but they don't capture temporal sequences well.&lt;/p&gt;

&lt;p&gt;Video receipts solve this. You get a playback of the full session with chapter markers at key decision points. Your PM can watch a 90-second clip instead of reading 200 lines of trace output. Your security team gets evidence they can archive. Your CI system gets an artifact to attach to the test report.&lt;/p&gt;

&lt;p&gt;For QA teams specifically, this changes the audit story. When a flaky test gets investigated, you currently spend 20-30 minutes reproducing the environment, loading traces, and reconstructing what happened. With a screencast, you open a video. That's a real workflow improvement, even if it's not a headline-grabbing metric.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;p&gt;Here's the implementation. The API supports chapter titles, action annotations, and visual overlays. You can configure frame capture rate and output format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;chromium&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;recordAgentSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;


  &lt;span class="c1"&gt;// Start screencast with chapter title&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./screencasts&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`session-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;.webm`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;fps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Add chapter marker&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addChapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Login Flow&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;startTime&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Authentication&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Your agent logic goes here&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Username&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;testuser&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Password&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;password123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Sign In&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Add action annotation overlay&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;annotate&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;action&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Clicked: Sign In&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;position&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Capture frame for AI vision processing&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;frame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;captureFrame&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Stop and finalize&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recording&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;screencast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Recording saved:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recording&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;filePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;recordAgentSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://app.example.com/dashboard&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;captureFrame()&lt;/code&gt; method is what makes this useful for AI vision workflows. You pass the JPEG buffer to your vision model for validation or further processing. The agent produces the evidence; you decide what to do with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gotcha Nobody Is Talking About
&lt;/h2&gt;

&lt;p&gt;Here's what the release notes don't emphasize: screencast recording in headless mode is not pixel-perfect. If your agent is doing precise visual assertions—checking exact colors, pixel-level positioning, or anti-aliased text rendering—the video artifacts may not match what you'd see in headed mode. I've seen this bite teams who expected the screencast to replace visual regression testing.&lt;/p&gt;

&lt;p&gt;The API works correctly and the implementation is solid, but it's recording a compressed video, not a pixel-accurate capture of the render pipeline. Use it for workflow verification, not for asserting that #FF5733 exactly matches your design token. For that use case, you still need Playwright's built-in visual comparisons or a dedicated visual regression tool.&lt;/p&gt;

&lt;p&gt;Also worth noting: the output file can get large quickly. A 5-minute session at 15 fps with visual overlays will easily be 50-100MB. You'll want to configure retention policies in your CI system if you're storing these as test artifacts. Don't let this become your next storage incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Changes in Your CI Pipeline
&lt;/h2&gt;

&lt;p&gt;The immediate impact is on how you handle failures from AI-driven test agents. Currently, when an agent-authored test fails, you have two options: trust the agent's explanation (risky) or manually reproduce the failure (slow). With screencasts, you get a third option: watch the video, verify the agent's logic, and make an informed decision in under 60 seconds.&lt;/p&gt;

&lt;p&gt;In practice, this means fewer "cannot reproduce" situations in your backlog. The debugging loop tightens from hours to minutes. For teams running autonomous agents in CI—yes, that's a real thing—this is a meaningful improvement in the feedback cycle.&lt;/p&gt;

&lt;p&gt;Storage considerations aside, the integration is straightforward. Add &lt;code&gt;page.screencast.start()&lt;/code&gt; to your fixture setup, route failures to your screencast storage, and update your test reporters to embed video links. Your team will adapt faster than you expect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Notes
&lt;/h2&gt;

&lt;p&gt;No migration required for existing tests. The Screencast API is additive—if you're not calling &lt;code&gt;page.screencast.start()&lt;/code&gt;, your current suite is unaffected. The breaking change in this release is macOS 14 WebKit support removal, which only affects you if you're running WebKit on a 14-year-old OS. Update your browser matrix if that applies.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;@playwright/experimental-ct-svelte&lt;/code&gt; package removal is a non-issue unless you were explicitly depending on an experimental package—which you shouldn't be doing in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Playwright v1.59.0's Screencast API is the feature that makes agentic testing verifiable instead of mysterious. The implementation is clean, the API is intuitive, and the use case is real. It's not a replacement for visual regression tooling, and the storage costs are real, but the observability gains are genuine.&lt;/p&gt;

&lt;p&gt;If you're evaluating AI coding agents for test automation, this is the feature that makes the evaluation tractable. You can now watch what the agent did instead of trusting what the agent claims it did. That's not a small thing.&lt;/p&gt;

&lt;p&gt;I've shipped test tooling at scale, and the difference between "we have logs" and "we have video evidence" is the difference between debugging in the dark and debugging with a flashlight. The Screencast API gives you the flashlight. Worth exploring in your next sprint.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>testing</category>
    </item>
    <item>
      <title>Porting Anthropic's Skill Creator from Python to TypeScript</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Fri, 17 Apr 2026 16:01:00 +0000</pubDate>
      <link>https://dev.to/aiwithanton/porting-anthropics-skill-creator-from-python-to-typescript-57l8</link>
      <guid>https://dev.to/aiwithanton/porting-anthropics-skill-creator-from-python-to-typescript-57l8</guid>
      <description>&lt;p&gt;Anthropic's skill-creator for Claude Code is excellent. It introduced eval-driven development for AI agent skills — write a skill, test it with evals, optimize the description, benchmark the results. The methodology is proven.&lt;/p&gt;

&lt;p&gt;But it has a limitation: it only works with Claude Code, and skill access requires a paid subscription ($20/month minimum). Free tier users can't use it at all.&lt;/p&gt;

&lt;p&gt;OpenCode is free and supports 300+ models. I wanted to bring the same methodology to OpenCode users — for free, with no paywall.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-Level Architecture
&lt;/h2&gt;

&lt;p&gt;The original has this structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Anthropic skill-creator/
├── SKILL.md                    # The skill instructions
├── scripts/
│   ├── run_loop.py             # Eval→improve optimization loop
│   ├── improve_description.py  # LLM-powered description improvement
│   ├── aggregate_benchmark.py   # Benchmark aggregation
│   └── generate_review.py       # HTML report generation
└── evals/
    └── evals.json              # Test query definitions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;opencode-skill-creator/
├── skill-creator/              # The SKILL
│   ├── SKILL.md                # Main skill instructions
│   ├── agents/
│   │   ├── grader.md           # Assertion evaluation
│   │   ├── analyzer.md         # Benchmark analysis
│   │   └── comparator.md       # Blind A/B comparison
│   ├── references/
│   │   └── schemas.md          # JSON schema definitions
│   └── templates/
│       └── eval-review.html    # Eval set review/edit UI
└── plugin/                     # The PLUGIN (npm package)
    ├── package.json            # npm package metadata
    ├── skill-creator.ts         # Entry point
    └── lib/
        ├── utils.ts            # SKILL.md frontmatter parsing
        ├── validate.ts          # Skill structure validation
        ├── run-eval.ts          # Trigger evaluation
        ├── improve-description.ts  # Description optimization
        ├── run-loop.ts          # Eval→improve loop
        ├── aggregate.ts         # Benchmark aggregation
        ├── report.ts            # HTML report generation
        └── review-server.ts     # HTTP eval review server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key difference: the skill provides workflow knowledge, the plugin provides executable tools. The agent orchestrates everything by calling tools during its session.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 1: Scripts → Plugin Tool Calls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt; Python scripts invoked via CLI&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Run the optimization loop
&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;scripts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_loop&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;skill&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;skill&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="nb"&gt;eval&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="n"&gt;evals&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;New:&lt;/strong&gt; Plugin tool calls in OpenCode sessions&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;skill_optimize_loop with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;evalSetPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/path/to/evals.json&lt;/span&gt;
  &lt;span class="na"&gt;skillPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/path/to/skill&lt;/span&gt;
  &lt;span class="na"&gt;maxIterations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why: OpenCode's plugin architecture lets agents call custom tools directly. No subprocess management, no script execution, no Python environment. The agent calls the tool inline and gets results back in the session.&lt;/p&gt;

&lt;p&gt;This is cleaner integration but also more composable. The agent can interleave tool calls with other work — read files, ask the user questions, make decisions — between optimization iterations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 2: Python → TypeScript
&lt;/h2&gt;

&lt;p&gt;The original requires Python 3.11+ and pyyaml. My version requires nothing beyond Node.js (which OpenCode users already have).&lt;/p&gt;

&lt;p&gt;All pipeline components — validation, eval, description improvement, loop runner, aggregation, report generation, review server — are TypeScript modules in the plugin. ~256kB unpacked on npm.&lt;/p&gt;

&lt;p&gt;Dependency tree is minimal: the plugin only depends on &lt;code&gt;@opencode-ai/plugin&lt;/code&gt; (peer dependency).&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 3: Static HTML → HTTP Review Server
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt; Python script generates a static HTML file and opens it in the browser.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;generate_review&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;
&lt;span class="c1"&gt;# Opens /path/to/workspace/review.html in browser
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;New:&lt;/strong&gt; Plugin starts a local HTTP server that serves an interactive eval viewer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;skill_serve_review&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nx"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;workspace&lt;/span&gt;
  &lt;span class="nx"&gt;skillName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-skill&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HTTP server approach has advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time updates when new eval results come in&lt;/li&gt;
&lt;li&gt;Interactive review with save buttons that write feedback back to files&lt;/li&gt;
&lt;li&gt;Previous/next navigation between eval cases&lt;/li&gt;
&lt;li&gt;Benchmark tab with quantitative metrics&lt;/li&gt;
&lt;li&gt;No file management — just open localhost:PORT&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The server can also generate static HTML for headless environments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;skill_export_static_review&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nx"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;workspace&lt;/span&gt;
  &lt;span class="nx"&gt;outputPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Decision 4: Subagents → Task Tool
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt; Claude Code's built-in subagent concept, where the skill directly spawns sub-agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New:&lt;/strong&gt; OpenCode's Task tool with &lt;code&gt;general&lt;/code&gt; and &lt;code&gt;explore&lt;/code&gt; subagent types. The SKILL.md instructs the agent to spawn tasks for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running eval cases (with-skill and baseline)&lt;/li&gt;
&lt;li&gt;Grading assertions against outputs&lt;/li&gt;
&lt;li&gt;Analyzing benchmark results&lt;/li&gt;
&lt;li&gt;Blind A/B comparison between skill versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent orchestrates these tasks and synthesizes their results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 5: Staging Outside the Repo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt; Evals and benchmarks run alongside the skill in the same directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New:&lt;/strong&gt; Draft skills and eval artifacts go to the system temp directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/tmp/opencode-skills/&amp;lt;skill-name&amp;gt;/           # Staged skill
/tmp/opencode-skills/&amp;lt;skill-name&amp;gt;-workspace/  # Eval artifacts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only the final validated skill gets installed to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project: &lt;code&gt;.opencode/skills/&amp;lt;skill-name&amp;gt;/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Global: &lt;code&gt;~/.config/opencode/skills/&amp;lt;skill-name&amp;gt;/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps the user's repository clean during skill development. Evals create a lot of artifacts (outputs, timing data, grading results, benchmark files) that you don't want mixed into your project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision 6: Strict Review Workflow
&lt;/h2&gt;

&lt;p&gt;Added a "review workflow guard" that enforces paired comparison data by default:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;skill_serve_review&lt;/code&gt; and &lt;code&gt;skill_export_static_review&lt;/code&gt; require each eval directory to include both &lt;code&gt;with_skill&lt;/code&gt; AND baseline (&lt;code&gt;without_skill&lt;/code&gt; or &lt;code&gt;old_skill&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;If pairs are missing, the tools fail fast with a clear list of what's missing&lt;/li&gt;
&lt;li&gt;Override with &lt;code&gt;allowPartial: true&lt;/code&gt; only when intentionally reviewing incomplete data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents a common mistake: reviewing eval results without a baseline comparison, which makes it impossible to judge whether the skill actually improved anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Skills are software
&lt;/h3&gt;

&lt;p&gt;They need testing, not just writing. The eval-driven approach catches issues you'd never find manually — like a description that triggers on 80% of relevant queries but also fires on 30% of irrelevant ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Description optimization matters more than skill content
&lt;/h3&gt;

&lt;p&gt;The description field is the primary triggering mechanism. A well-optimized description on an average skill outperforms a poor description on a perfect skill. This is counterintuitive but matches the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Train/test splits prevent overfitting
&lt;/h3&gt;

&lt;p&gt;Same lesson as ML hyperparameter tuning. If you only evaluate on the queries you optimize for, descriptions become overfit. The 60/40 split keeps you honest about generalization.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Human-in-the-loop review is essential
&lt;/h3&gt;

&lt;p&gt;Automation measures triggering accuracy, but humans judge output quality. The visual eval viewer puts outputs side by side so you can see whether the skill produces genuinely useful results, not just correctly-triggered results.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Plugin architecture enables composition
&lt;/h3&gt;

&lt;p&gt;Having eval, benchmarking, and review as separate tool calls (instead of a monolithic script) means the agent can interleave them with other work. It can ask the user a question between iterations, read relevant files during eval, or skip steps the user doesn't need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache 2.0, free, open source. Works with any of OpenCode's supported models.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;https://github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;br&gt;
npm: &lt;a href="https://www.npmjs.com/package/opencode-skill-creator" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: &lt;code&gt;npx opencode-skill-creator install --global&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opencode</category>
      <category>typescript</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Wed, 15 Apr 2026 18:20:22 +0000</pubDate>
      <link>https://dev.to/aiwithanton/i-ate-my-own-dog-food-how-i-benchmarked-ai-skills-and-proved-eval-driven-development-works-c0l</link>
      <guid>https://dev.to/aiwithanton/i-ate-my-own-dog-food-how-i-benchmarked-ai-skills-and-proved-eval-driven-development-works-c0l</guid>
      <description>&lt;p&gt;I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me.&lt;/p&gt;

&lt;p&gt;As a QA architect, I've spent my career building systems that verify software works correctly. At Apple, we tested everything — every interaction, every edge case, every regression. At CooperVision, I built a Playwright/TypeScript framework from scratch that grew test coverage by 300%.&lt;/p&gt;

&lt;p&gt;So when I started working with AI agent skills, I noticed something: &lt;strong&gt;nobody was testing them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You write a SKILL.md file. You try it manually once. Maybe it works for your prompt. You ship it.&lt;/p&gt;

&lt;p&gt;There's no automated test suite. No regression testing. No CI pipeline that catches when a description change breaks triggering.&lt;/p&gt;

&lt;p&gt;That's a QA problem. I built &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;opencode-skill-creator&lt;/a&gt; to solve it.&lt;/p&gt;

&lt;p&gt;Then I dogfooded it on a real project. Here's what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Project: AdLoop Skills for Google Ads
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kLOsk/adloop" rel="noopener noreferrer"&gt;AdLoop&lt;/a&gt; is a Google Ads MCP (Model Context Protocol) integration — it connects AI agents to Google Ads and GA4 data through a set of tools.&lt;/p&gt;

&lt;p&gt;I created 4 skills for AdLoop using opencode-skill-creator:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;adloop-planning&lt;/strong&gt; — Keyword research, competition analysis, and budget forecasting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-read&lt;/strong&gt; — Performance analysis, campaign reporting, and conversion diagnostics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-write&lt;/strong&gt; — Campaign creation, ad management, keyword bidding, and budget changes (spends real money)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;adloop-tracking&lt;/strong&gt; — GA4 event validation, conversion tracking diagnosis, and code generation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Benchmark Results
&lt;/h2&gt;

&lt;p&gt;opencode-skill-creator's benchmark runs each skill through its eval queries in two configurations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;With skill loaded&lt;/strong&gt; — the AI has full domain knowledge, safety rules, and orchestration patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Without skill&lt;/strong&gt; — the AI only has bare MCP tool names and descriptions&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Evals&lt;/th&gt;
&lt;th&gt;With Skill&lt;/th&gt;
&lt;th&gt;Without Skill&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;adloop-write&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;17%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+83pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-planning&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;21%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+79pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-read&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;27%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+73pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;adloop-tracking&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;33%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+67pp&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But the raw numbers only tell part of the story. The &lt;em&gt;failures&lt;/em&gt; without skills aren't just wrong answers — they're dangerous actions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scariest Failure: Real Money at Stake
&lt;/h2&gt;

&lt;p&gt;adloop-write manages campaigns, ads, keywords, and budgets — operations that &lt;strong&gt;spend real money&lt;/strong&gt;. Without the skill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Added BROAD match keywords to MANUAL_CPC campaigns&lt;/strong&gt; — the #1 cause of wasted ad spend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set budget above safety caps&lt;/strong&gt; ($100 when max is $50) — no guardrail&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deleted campaigns irreversibly without warning&lt;/strong&gt; — no confirmation, no pause alternative&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batched multiple changes in one call&lt;/strong&gt; — bypassing review steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't about "better answers." This is about &lt;strong&gt;preventing real financial harm&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  GDPR ≠ Broken Tracking
&lt;/h2&gt;

&lt;p&gt;A common scenario: 500 clicks in Google Ads, 180 sessions in GA4. "Is my tracking broken?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without the skill&lt;/strong&gt;, AI diagnosed this as a tracking issue and offered to investigate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With the skill&lt;/strong&gt;, AI recognized: "A 2.8:1 ratio is normal with GDPR consent banners. Google Ads counts all clicks. GA4 only counts consenting users. Your tracking is fine."&lt;/p&gt;

&lt;p&gt;The #1 false positive in digital marketing analytics, prevented by domain knowledge in the skill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't Trust Google Blindly
&lt;/h2&gt;

&lt;p&gt;Without the skill, AI endorsed Google's recommendations at face value: "Raise budget" with zero conversions. "Add BROAD match" without Smart Bidding.&lt;/p&gt;

&lt;p&gt;The skill explicitly states: &lt;strong&gt;"Google recommendations optimize for Google's revenue, not yours."&lt;/strong&gt; It cross-references against conversion data first. The 73% improvement comes from teaching critical thinking, not compliance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;The same AI model. The same tools. The same prompts. The only variable: whether the skill is loaded. The difference is 67–83 percentage points.&lt;/p&gt;

&lt;p&gt;Skills do three things bare tool access doesn't:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inject domain expertise&lt;/strong&gt; — GDPR mechanics, budget rules, competition levels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforce safety guardrails&lt;/strong&gt; — budget caps, deletion warnings, one-change-at-a-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provide orchestration patterns&lt;/strong&gt; — when to call which tool, in what order, with what validation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free, open source (Apache 2.0). Works with any of OpenCode's 300+ supported models. Pure TypeScript, zero Python dependency.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Skills are software. Software should be tested.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test. Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>qa</category>
      <category>opensource</category>
      <category>openclaw</category>
    </item>
    <item>
      <title>Eval-Driven Development for AI Agent Skills</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Wed, 15 Apr 2026 17:58:00 +0000</pubDate>
      <link>https://dev.to/aiwithanton/eval-driven-development-for-ai-agent-skills-3jpg</link>
      <guid>https://dev.to/aiwithanton/eval-driven-development-for-ai-agent-skills-3jpg</guid>
      <description>&lt;h2&gt;
  
  
  The Problem with Writing Skills by Hand
&lt;/h2&gt;

&lt;p&gt;You've written a skill for your AI coding agent. It's got clear instructions, proper formatting, a good description. You test it in a session — it works. Ship it, right?&lt;/p&gt;

&lt;p&gt;Not so fast.&lt;/p&gt;

&lt;p&gt;Skills trigger based on their description field — a 1-2 sentence summary in the SKILL.md frontmatter. And here's the thing: descriptions that seem crystal clear to humans often trigger wrong. Too specific, and the skill never activates when it should. Too broad, and it fires on unrelated prompts.&lt;/p&gt;

&lt;p&gt;The result: skills that feel right in theory but fail unpredictably in practice. And there's no systematic way to measure whether a skill is getting better or worse across iterations.&lt;/p&gt;

&lt;p&gt;This is the same problem software engineering solved decades ago with automated testing. Skills are software. They need testing too.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Eval-Driven Development?
&lt;/h2&gt;

&lt;p&gt;Eval-driven development is the practice of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Writing test cases&lt;/strong&gt; that define expected behavior&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Running those tests automatically&lt;/strong&gt; to measure actual vs. expected outcomes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using the results to improve&lt;/strong&gt; iteratively, with quantifiable evidence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For AI agent skills, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generating test prompts (should-trigger and should-not-trigger queries)&lt;/li&gt;
&lt;li&gt;Running each prompt with and without the skill&lt;/li&gt;
&lt;li&gt;Comparing outputs to see if the skill actually improves results&lt;/li&gt;
&lt;li&gt;Optimizing the description so the skill triggers on the right prompts&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Skill Creation Lifecycle
&lt;/h2&gt;

&lt;p&gt;opencode-skill-creator implements eval-driven development as a structured lifecycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Create → Evaluate → Optimize → Benchmark → Install
   ↑                                      |
   └──────────── Iterate ─────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1. Create
&lt;/h3&gt;

&lt;p&gt;Start with an intake interview. The skill-creator asks 3-5 targeted questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should this skill enable the agent to do?&lt;/li&gt;
&lt;li&gt;When should it trigger?&lt;/li&gt;
&lt;li&gt;What output format is expected?&lt;/li&gt;
&lt;li&gt;What workflow steps must be preserved exactly?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This captures intent before writing any code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Evaluate
&lt;/h3&gt;

&lt;p&gt;Auto-generate eval test sets — realistic prompts categorized as should-trigger or should-not-trigger. Run each test case twice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;With skill&lt;/strong&gt;: The agent has the skill loaded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Without skill&lt;/strong&gt;: The agent runs without it (baseline)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This measures whether the skill actually improves the output for relevant prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Optimize
&lt;/h3&gt;

&lt;p&gt;The description optimization loop treats triggering accuracy as a search problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For each iteration (up to 5):
  1. Evaluate current description on train set (60%)
  2. Analyze failure patterns
  3. LLM proposes improved description
  4. Evaluate on both train AND test (40%) sets
  5. Select best description by test score
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 60/40 train/test split prevents overfitting. An description that works perfectly on train queries but fails on held-out test queries is overfit — it's memorized specific prompts rather than learning the general pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Benchmark
&lt;/h3&gt;

&lt;p&gt;Run the full eval suite across multiple iterations with variance analysis. This answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the skill getting consistently better?&lt;/li&gt;
&lt;li&gt;Are there eval cases where the skill never triggers correctly?&lt;/li&gt;
&lt;li&gt;How much variance is there across runs?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The benchmark includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pass rates (with-skill vs. baseline)&lt;/li&gt;
&lt;li&gt;Timing data (tokens, duration)&lt;/li&gt;
&lt;li&gt;Mean ± standard deviation for each metric&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Install
&lt;/h3&gt;

&lt;p&gt;Install the final validated skill to project-level (&lt;code&gt;.opencode/skills/&lt;/code&gt;) or global (&lt;code&gt;~/.config/opencode/skills/&lt;/code&gt;). Only the final version gets installed — eval artifacts stay in the staging directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Skills are software
&lt;/h3&gt;

&lt;p&gt;They have inputs (prompts), outputs (agent behavior), and a triggering mechanism (the description). Just like any software, they need testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual testing doesn't scale
&lt;/h3&gt;

&lt;p&gt;You can test a skill manually in a session, but that's one prompt, one run, no measurement. Eval-driven development gives you 20+ test cases, multiple runs per case, and quantitative metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Description optimization is more impactful than skill content
&lt;/h3&gt;

&lt;p&gt;The description field is the primary triggering mechanism. A perfectly-written skill with a poor description won't trigger. An average skill with an optimized description will trigger reliably. The optimization loop focuses effort where it matters most.&lt;/p&gt;

&lt;h3&gt;
  
  
  Train/test splits prevent overfitting
&lt;/h3&gt;

&lt;p&gt;If you only test on the same queries you optimize for, descriptions become overfit — they work on those specific prompts but fail on real-world usage. The 60/40 split keeps you honest.&lt;/p&gt;

&lt;h3&gt;
  
  
  Human review catches what automation misses
&lt;/h3&gt;

&lt;p&gt;The visual eval viewer puts outputs side by side so you can see with your own eyes whether the skill is producing good results. Quantitative metrics tell you if it's triggering correctly; human review tells you if the output is actually useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask OpenCode to create or improve a skill. The eval-driven workflow starts automatically.&lt;/p&gt;

&lt;p&gt;Apache 2.0, free, open source. Works with any of OpenCode's supported models.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;https://github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: &lt;code&gt;npx opencode-skill-creator install --global&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opencode</category>
      <category>typescript</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How to Create Custom OpenCode Skills (Step-by-Step Guide)</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sun, 12 Apr 2026 18:52:31 +0000</pubDate>
      <link>https://dev.to/aiwithanton/how-to-create-custom-opencode-skills-step-by-step-guide-4ijd</link>
      <guid>https://dev.to/aiwithanton/how-to-create-custom-opencode-skills-step-by-step-guide-4ijd</guid>
      <description>&lt;h2&gt;
  
  
  Why Custom Skills Matter
&lt;/h2&gt;

&lt;p&gt;Out-of-the-box AI coding agents are powerful, but they don't know your team's conventions, your deployment process, or your documentation style. Skills let you encode that knowledge so the agent follows your workflows every time.&lt;/p&gt;

&lt;p&gt;But creating skills has been guesswork. You write a SKILL.md file, test it manually in a session, maybe tweak the description, and hope it works. There's no feedback loop, no measurement, no way to know if a change actually improved things.&lt;/p&gt;

&lt;p&gt;opencode-skill-creator changes this by providing a structured workflow for the full skill lifecycle: create, evaluate, optimize, benchmark, and install.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;OpenCode installed and configured&lt;/li&gt;
&lt;li&gt;Node.js 18+ (for the npm package)&lt;/li&gt;
&lt;li&gt;5 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Install
&lt;/h2&gt;

&lt;p&gt;One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This adds the plugin to your global OpenCode config. Restart OpenCode to activate it.&lt;/p&gt;

&lt;p&gt;Verify the install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; ~/.config/opencode/skills/skill-creator/SKILL.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask OpenCode: &lt;code&gt;Create a skill that helps with Docker compose files&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You should see it use the skill-creator workflow and tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Describe What You Want
&lt;/h2&gt;

&lt;p&gt;The skill-creator starts with an intake interview. It asks 3-5 targeted questions about what your skill should do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What should this skill enable OpenCode to do end-to-end?&lt;/li&gt;
&lt;li&gt;When should this skill trigger?&lt;/li&gt;
&lt;li&gt;What output format and quality bar are expected?&lt;/li&gt;
&lt;li&gt;What workflow steps must be preserved vs. where can the agent improvise?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't skip this. The interview captures your intent before any code is written. Think of it as shadowing a teammate — you're the domain expert, the agent is the new hire learning your workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Review the Skill Draft
&lt;/h2&gt;

&lt;p&gt;Based on your interview, the skill-creator produces a draft SKILL.md with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proper YAML frontmatter (name and description)&lt;/li&gt;
&lt;li&gt;Markdown instructions for the agent&lt;/li&gt;
&lt;li&gt;Optional supporting files (references, agents, templates)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The draft goes to a staging directory (outside your repo) so your project stays clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/tmp/opencode-skills/your-skill-name/
├── SKILL.md
├── agents/
├── references/
└── templates/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Review this draft. Make sure the description is accurate (it's the primary triggering mechanism) and the instructions reflect your actual workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Generate Eval Test Cases
&lt;/h2&gt;

&lt;p&gt;The skill-creator automatically generates test cases — realistic prompts that an OpenCode user would actually type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skill_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"docker-compose"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"evals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"help me set up a compose file for my Node app with a Postgres database"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expected_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Skill triggers and provides Docker compose guidance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"should_trigger"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"explain how Kubernetes deployments work"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"should_trigger"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good eval queries are realistic and specific — not abstract like "help with containers" but concrete like "ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx')..."&lt;/p&gt;

&lt;p&gt;Review the eval set. Add or modify test cases that reflect your real usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Run Evals
&lt;/h2&gt;

&lt;p&gt;The eval system runs each test case twice — once with the skill and once without (baseline). This measures whether the skill actually improves the output.&lt;/p&gt;

&lt;p&gt;For each test case:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;OpenCode runs with the skill loaded&lt;/li&gt;
&lt;li&gt;OpenCode runs without the skill&lt;/li&gt;
&lt;li&gt;Both outputs are saved for comparison&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Timing data (tokens used, duration) is captured automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Review Results Visually
&lt;/h2&gt;

&lt;p&gt;The skill-creator launches an HTML eval viewer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Call skill_serve_review with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;workspace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/tmp/opencode-skills/your-skill-name-workspace/iteration-1&lt;/span&gt;
  &lt;span class="na"&gt;skillName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-skill-name"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The viewer shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outputs tab&lt;/strong&gt;: Each test case with with-skill and without-skill outputs side by side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark tab&lt;/strong&gt;: Quantitative metrics — pass rates, timing, token usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feedback fields&lt;/strong&gt;: Leave comments on each test case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Review the outputs. Give specific feedback on what's working and what's not. Empty feedback means "looks good."&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Iterate and Improve
&lt;/h2&gt;

&lt;p&gt;Based on your feedback, the skill-creator improves the skill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Applies your feedback&lt;/li&gt;
&lt;li&gt;Reruns all test cases (new iteration)&lt;/li&gt;
&lt;li&gt;Launches the reviewer with previous iteration for comparison&lt;/li&gt;
&lt;li&gt;You review again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Repeat until you're satisfied or feedback is all empty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 8: Optimize the Description
&lt;/h2&gt;

&lt;p&gt;Even with perfect skill instructions, the skill won't trigger correctly if the description field isn't right. The description is what OpenCode reads to decide whether to load your skill.&lt;/p&gt;

&lt;p&gt;The optimization loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generates 20 eval queries (should-trigger and should-not-trigger)&lt;/li&gt;
&lt;li&gt;Splits them 60/40 into train/test&lt;/li&gt;
&lt;li&gt;Evaluates each query 3 times for statistical reliability&lt;/li&gt;
&lt;li&gt;Analyzes failure patterns&lt;/li&gt;
&lt;li&gt;LLM proposes improved descriptions&lt;/li&gt;
&lt;li&gt;Re-evaluates on both train and test&lt;/li&gt;
&lt;li&gt;Selects the best description by test score&lt;/li&gt;
&lt;li&gt;Repeats up to 5 iterations
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Tell OpenCode:&lt;/span&gt;
&lt;span class="s2"&gt;"Optimize the description of my docker-compose skill"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This takes some time — grab a coffee while it runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 9: Install the Final Skill
&lt;/h2&gt;

&lt;p&gt;Once you're satisfied with the skill and its description:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project-level&lt;/strong&gt;: &lt;code&gt;.opencode/skills/your-skill-name/SKILL.md&lt;/code&gt; — available only in this project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global&lt;/strong&gt;: &lt;code&gt;~/.config/opencode/skills/your-skill-name/SKILL.md&lt;/code&gt; — available everywhere
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Project-level install&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; /tmp/opencode-skills/your-skill-name/ .opencode/skills/your-skill-name/

&lt;span class="c"&gt;# Global install&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; /tmp/opencode-skills/your-skill-name/ ~/.config/opencode/skills/your-skill-name/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only the final validated skill gets installed. All eval artifacts stay in the staging directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Example: Docker Compose Skill
&lt;/h2&gt;

&lt;p&gt;Here's what the full workflow looks like in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ask OpenCode&lt;/strong&gt;: "Create a skill that helps with Docker compose files"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Interview&lt;/strong&gt;: The skill-creator asks about your conventions (multi-service vs. single container, development vs. production, preferred base images)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Draft&lt;/strong&gt;: Produces a SKILL.md with Docker compose best practices, service configuration patterns, volume mount strategies&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Eval&lt;/strong&gt;: Generates test cases like "my api keeps crashing on startup, can you help me debug my compose file" (should trigger) and "what's the difference between Docker and Podman" (should not trigger)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Review&lt;/strong&gt;: You look at the outputs, give feedback: "the skill should prioritize security configurations in production compose files"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iterate&lt;/strong&gt;: Improved skill draft, better outputs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Optimize&lt;/strong&gt;: Description goes from "Help with Docker compose files" to something much more specific that triggers reliably&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;: Copy to &lt;code&gt;~/.config/opencode/skills/docker-compose/&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Tips for Great Skills
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Be specific in the intake interview&lt;/strong&gt;: The more context you give, the better the draft&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't skip evals&lt;/strong&gt;: They catch triggering issues you'd never find manually&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use realistic test prompts&lt;/strong&gt;: Write them the way you'd actually type them, typos and all&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate at least twice&lt;/strong&gt;: First drafts are rarely perfect&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize the description&lt;/strong&gt;: It's the #1 factor in whether your skill triggers correctly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install globally for general skills, project-level for specific ones&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx opencode-skill-creator &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask OpenCode to create a skill. That's it.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/antongulin/opencode-skill-creator" rel="noopener noreferrer"&gt;https://github.com/antongulin/opencode-skill-creator&lt;/a&gt;&lt;br&gt;
npm: &lt;a href="https://www.npmjs.com/package/opencode-skill-creator" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/opencode-skill-creator&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: &lt;code&gt;npx opencode-skill-creator install --global&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>ai</category>
      <category>opencode</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
