<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anton Gulin</title>
    <description>The latest articles on DEV Community by Anton Gulin (@aiwithanton).</description>
    <link>https://dev.to/aiwithanton</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872452%2F17f47297-ddc6-457c-9920-47c0dd1acd1b.png</url>
      <title>DEV Community: Anton Gulin</title>
      <link>https://dev.to/aiwithanton</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aiwithanton"/>
    <language>en</language>
    <item>
      <title>How to Implement AI in QA (2026): A Practical Framework</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sat, 04 Jul 2026 15:40:25 +0000</pubDate>
      <link>https://dev.to/aiwithanton/how-to-implement-ai-in-qa-2026-a-practical-framework-481f</link>
      <guid>https://dev.to/aiwithanton/how-to-implement-ai-in-qa-2026-a-practical-framework-481f</guid>
      <description>&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;To implement AI in QA, let AI do the work and let a fixed check decide the result. Use AI to generate tests, repair broken selectors, and explore your app. Never let AI grade its own output. Add an independent check the AI cannot change, test for repeatable results, and test that the agent did only what you asked.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "AI in QA" actually means
&lt;/h2&gt;

&lt;p&gt;AI in QA means using an AI model to help test software. That is the whole idea. The model can write test cases, fix tests that broke, click through your app like a user, or read a failure and guess the cause.&lt;/p&gt;

&lt;p&gt;It does not mean the AI replaces testing. It means the AI does some of the work a tester used to do by hand. The judgment stays with you.&lt;/p&gt;

&lt;p&gt;I test software for a living. The teams that win with AI in QA all draw the same line. AI does the work. A human, and a fixed check, decide if the work is good.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one rule that keeps it safe
&lt;/h2&gt;

&lt;p&gt;Here is the rule the rest of this guide hangs on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let AI do the work. Never let AI judge its own work.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Picture an AI agent that tests your login. It clicks around. It reports green. Everyone relaxes. But the agent decided what "pass" means. If it is too kind, it passes a broken page. Now you shipped a bug with a green check on top.&lt;/p&gt;

&lt;p&gt;An AI that grades its own work is not a test. It is an opinion.&lt;/p&gt;

&lt;p&gt;So you give the AI room to explore, and you keep one thing it cannot touch. A fixed check. A known-good answer. A hard assert (a check that fails loudly) on the real outcome. The agent finds the path. The fixed check says pass or fail.&lt;/p&gt;

&lt;p&gt;In testing this fixed answer has a name: an oracle. The oracle is the part the system being tested is not allowed to influence. Keep your oracle out of the AI's reach and most AI-in-QA risk goes away.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI helps in QA (the 4 good jobs)
&lt;/h2&gt;

&lt;p&gt;These four jobs are where AI pays off today.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Writing tests.&lt;/strong&gt; Point the model at a page or a user story. It drafts test cases, including edge cases a tired human skips. You review and keep the good ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fixing broken tests.&lt;/strong&gt; A button moved and the test broke. AI can find the new selector (how a test finds a button) and propose the fix. This is the biggest time-saver for most teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exploring the app.&lt;/strong&gt; An AI agent can wander your app like a curious user and report what feels broken. Great for finding the bug nobody wrote a test for.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reading failures.&lt;/strong&gt; When a test fails, AI can read the log and the trace and suggest the likely cause. It turns a wall of red into a short list to check.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In all four, the AI proposes. You and your fixed checks dispose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI lies in QA (the 4 traps)
&lt;/h2&gt;

&lt;p&gt;This is the part most guides skip. AI in QA fails in four ways. Plan for each.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It passes a broken thing.&lt;/strong&gt; A too-kind agent calls a broken page "fine." Fix: an independent oracle the agent cannot move.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not repeatable.&lt;/strong&gt; The same input passes now and fails in ten minutes. Fix: run the same input twice and compare the shape of the answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It does too much.&lt;/strong&gt; You asked for one thing. The agent also changed a setting or sent a message. Fix: a scope check on what it touched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It depends on a model that can change.&lt;/strong&gt; The model under your tool updates, or even goes offline, and your tests shift with it. Fix: pin the model version and keep a fallback.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That last one is not theoretical. In June 2026 a widely used model was pulled offline overnight. Teams that pinned their model switched to a fallback in one line. Teams that did not found out when their build broke.&lt;/p&gt;

&lt;h2&gt;
  
  
  A working example: an AI agent with an oracle it cannot move
&lt;/h2&gt;

&lt;p&gt;Here is the pattern in code. An AI agent books a meeting room. Then three fixed checks decide if it really worked. The agent never grades itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Stagehand&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@browserbasehq/stagehand&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AI books a room — and only that&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Pin the model. Do not let it auto-upgrade under your tests.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Stagehand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;LOCAL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-8&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// 1) Let the AI do the work. It decides HOW to book the room.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;act&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Book room B for 2pm tomorrow, for 30 minutes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// 2) The fixed check the AI cannot move (the oracle).&lt;/span&gt;
  &lt;span class="c1"&gt;//    These helpers read your real database, not the agent's report.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;booking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getBookingFromDb&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;room&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;B&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;14:00&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeTruthy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;          &lt;span class="c1"&gt;// it did the task&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;durationMin&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// exactly what we asked for&lt;/span&gt;

  &lt;span class="c1"&gt;// 3) Scope check. Did it touch anything it should not have?&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;otherChanges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getChangesExcept&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;otherChanges&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveLength&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// no surprise side effects&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the three checks again. The agent's own "I booked it" is never trusted. The database is the oracle. The duration check catches a sloppy booking. The scope check catches the agent doing extra. (&lt;code&gt;getBookingFromDb&lt;/code&gt; and &lt;code&gt;getChangesExcept&lt;/code&gt; are your own helpers — they read real state, not the agent's words.)&lt;/p&gt;

&lt;p&gt;To catch the repeatable-result trap, run the same prompt twice and compare:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;same request, same result&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;runBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Book room B for 2pm tomorrow, 30 minutes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;runBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Book room B for 2pm tomorrow, 30 minutes&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// The wording of the agent's reply may differ. The outcome may not.&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;room&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;room&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;durationMin&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;durationMin&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model is allowed to phrase its answer differently each time. It is not allowed to book a different room.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 tests every AI feature needs
&lt;/h2&gt;

&lt;p&gt;If you ship an AI feature to users, these three tests catch the failures that page you at 2am. Most teams only write the first easy one ("does it give a good answer?").&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The wrong-input test.&lt;/strong&gt; Feed it junk, empty fields, another language, a user trying to break it. A good feature fails safely. A bad one is confidently wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The same-input-twice test.&lt;/strong&gt; Run the exact input twice. Same kind of answer? Different wording is fine. Pass-then-fail is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The scope test.&lt;/strong&gt; Did it do only what you asked? Or did it also change a setting, send a message, or touch a file? Extra is not helpful. Extra is a future incident.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where to start on Monday
&lt;/h2&gt;

&lt;p&gt;You do not need a platform or a budget. Start small.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick one flaky test. Let an AI tool propose the fix. You keep final say.&lt;/li&gt;
&lt;li&gt;Add one oracle. Take your most important flow and add a hard check on the real outcome, not the agent's report.&lt;/li&gt;
&lt;li&gt;Pin your model. Lock the version your AI tools use. Add a fallback.&lt;/li&gt;
&lt;li&gt;Add the scope check to one agent run. See what it touches when you are not looking.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do those four and you have AI in QA that you can trust. The AI does more work. You keep the judgment. The fixed checks keep everyone honest.&lt;/p&gt;

&lt;p&gt;That is the whole job: the gap between "the AI says it passed" and "it passed, for the right reason."&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>qa</category>
      <category>testing</category>
      <category>testautomation</category>
    </item>
    <item>
      <title>How to Test Passkey (WebAuthn) Login in Playwright (2026)</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sat, 27 Jun 2026 15:43:03 +0000</pubDate>
      <link>https://dev.to/aiwithanton/how-to-test-passkey-webauthn-login-in-playwright-2026-4a71</link>
      <guid>https://dev.to/aiwithanton/how-to-test-passkey-webauthn-login-in-playwright-2026-4a71</guid>
      <description>&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;You can now test passkey login in Playwright with no hardware key. Playwright 1.61 added a virtual authenticator (a fake security key). Your test seeds a passkey, turns it on, and the page signs in as if a real key answered. It works in every browser and runs in CI. The API is &lt;code&gt;browserContext.credentials&lt;/code&gt;, with three methods: &lt;code&gt;create()&lt;/code&gt;, &lt;code&gt;install()&lt;/code&gt;, and &lt;code&gt;get()&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a passkey, in plain words
&lt;/h2&gt;

&lt;p&gt;A passkey is a login with no password. You sign in with your face, your fingerprint, or a device PIN. The hard part lives in your device. The website only sees a signed reply.&lt;/p&gt;

&lt;p&gt;The browser standard behind this is called WebAuthn. It means "web authentication". When you log in, the browser runs a small back-and-forth with the site. The site asks. Your device answers and signs. This is the part that used to need real hardware.&lt;/p&gt;

&lt;p&gt;Apple, Google, and most banks ship passkeys now. If your app has a "sign in with a passkey" button, you have this flow in production today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why nobody tested this flow
&lt;/h2&gt;

&lt;p&gt;Here is the part nobody talks about. Almost nobody tests the passkey login.&lt;/p&gt;

&lt;p&gt;For years it was hard. To test a passkey you needed a real security key plugged into the machine. You cannot plug a USB key into a CI server. CI is a remote build machine with no hands and no ports.&lt;/p&gt;

&lt;p&gt;So the most important login flow shipped untested. The one thing a user does first. The one thing that locks them out if it breaks.&lt;/p&gt;

&lt;p&gt;I test software for a living. An untested login is the scariest gap on the list. If sign-in breaks, nothing else matters. The cart, the dashboard, the settings page, all of it sits behind the door. On one project I watched a broken auth path block every other test for two days. The fix took ten minutes. Finding it took the two days.&lt;/p&gt;

&lt;p&gt;Playwright 1.61 closed this gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Playwright 1.61 added
&lt;/h2&gt;

&lt;p&gt;Playwright 1.61 shipped on June 15, 2026. It added a virtual authenticator (a fake security key).&lt;/p&gt;

&lt;p&gt;A virtual authenticator is software that pretends to be a hardware key. Your test creates one. It seeds a passkey into it. From then on, when the page calls the browser to sign in, Playwright answers for the key. No real device. No USB port. The page cannot tell the difference.&lt;/p&gt;

&lt;p&gt;You reach it through a new class called &lt;code&gt;Credentials&lt;/code&gt;, on the browser context: &lt;code&gt;browserContext.credentials&lt;/code&gt;. It has three methods you will use most:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;create()&lt;/code&gt; seeds a test passkey for a site.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;install()&lt;/code&gt; turns the virtual key on for the page.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;get()&lt;/code&gt; reads back any passkey the page registered, so you can save it and reuse it later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This works in all three browser engines Playwright drives. So one test covers Chromium, Firefox, and WebKit.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to test passkey login — a working example
&lt;/h2&gt;

&lt;p&gt;Here is a complete test. It seeds a passkey, turns on the virtual key, then signs in. Read the comments for what each step does.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// passkey-login.spec.ts&lt;/span&gt;
&lt;span class="c1"&gt;// Tested against Playwright 1.61. Run with: npx playwright test&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user signs in with a passkey&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// A fresh, clean browser session for this test.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newContext&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// STEP 1 — Seed a passkey for our site.&lt;/span&gt;
  &lt;span class="c1"&gt;// 'example.com' is the site domain (the "relying party id").&lt;/span&gt;
  &lt;span class="c1"&gt;// With only the domain, Playwright makes a fresh key for us.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// STEP 2 — Turn the virtual key on.&lt;/span&gt;
  &lt;span class="c1"&gt;// From now on, the page's sign-in calls are answered by our key,&lt;/span&gt;
  &lt;span class="c1"&gt;// not by real hardware. Call this before the page loads.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;install&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// STEP 3 — Let the page use it.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://example.com/login&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// The page calls the browser to sign in. Our key answers.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Sign in with a passkey&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Check the user is in.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Welcome back&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three steps map to the three method calls. First you &lt;code&gt;create()&lt;/code&gt; a passkey for your site. Then you &lt;code&gt;install()&lt;/code&gt; the virtual key, which makes the page's sign-in calls run through it. Then the page does its normal login, and your key answers in place of hardware.&lt;/p&gt;

&lt;p&gt;One note on order. Call &lt;code&gt;install()&lt;/code&gt; before the page touches sign-in. The virtual key only answers calls that happen after you turn it on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Re-using a passkey across tests
&lt;/h2&gt;

&lt;p&gt;Often you want to register a passkey once, then reuse it in many tests. A passkey holds a private key (a secret only your device knows). You can read that secret back with &lt;code&gt;get()&lt;/code&gt; and seed it into a later test.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In a setup test: register once, then read the passkey back.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;created&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;rpId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="c1"&gt;// `created` holds the passkey fields, including its keys.&lt;/span&gt;
&lt;span class="c1"&gt;// Save them, then seed an identical passkey in a later test:&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;otherContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;created&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;userHandle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;created&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;userHandle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;privateKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;created&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;privateKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;publicKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;created&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;publicKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how you keep one stable test user across a whole suite. You do not re-register on every test. You seed the same passkey each run. See the &lt;a href="https://playwright.dev/docs/api/class-credentials" rel="noopener noreferrer"&gt;Credentials docs&lt;/a&gt; for the full field list.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest part: what this does not test
&lt;/h2&gt;

&lt;p&gt;A virtual key is not a real key. So this approach tests your login flow, not the physical hardware. It will not catch a bug in a specific phone's secure chip or a real fingerprint reader. It tests the part you own: the page, the back-and-forth, the server check.&lt;/p&gt;

&lt;p&gt;For most teams that is the right line. The browser and the operating system test the hardware path for you. Your job is to test that your app asks the right question and trusts the right answer. That is exactly what the virtual key lets you do, in CI, on every push.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters for your CI pipeline
&lt;/h2&gt;

&lt;p&gt;Before 1.61, your passkey login had two test options. Skip it, or test it by hand. Both are bad. A skipped test means a silent break. A by-hand test runs once a release, not once a push.&lt;/p&gt;

&lt;p&gt;Now it runs like any other test. It sits in your suite. It runs on every pull request. If someone changes the login and breaks the passkey path, the build goes red before the change ships. That is the whole point of a test. Catch the break in seconds, not from an angry user.&lt;/p&gt;

&lt;p&gt;If your app supports passkeys, this is the test you write this week. The excuse is gone. The login everyone ships and nobody verifies is now testable.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Related reads on anton.qa:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://anton.qa/blog/posts/playwright-best-practices" rel="noopener noreferrer"&gt;Playwright best practices that keep tests stable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://anton.qa/blog/posts/score-ai-test-agents-offline-evaluation" rel="noopener noreferrer"&gt;How to score your AI test agents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Official sources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/microsoft/playwright/releases/tag/v1.61.0" rel="noopener noreferrer"&gt;Playwright v1.61.0 release notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://playwright.dev/docs/api/class-credentials" rel="noopener noreferrer"&gt;Playwright Credentials API docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Internal References
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;(Reviewer-only. The dev.to publisher strips this section — it lives after the bio and is not part of the published article. Same source receipts as the canonical blog.)&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Canonical: &lt;a href="https://anton.qa/blog/posts/test-passkey-login-playwright" rel="noopener noreferrer"&gt;https://anton.qa/blog/posts/test-passkey-login-playwright&lt;/a&gt; (live Wed 2026-06-24; dev.to publishes +72h once canonical is live).&lt;/li&gt;
&lt;li&gt;All API claims (&lt;code&gt;browserContext.credentials&lt;/code&gt;, &lt;code&gt;create()&lt;/code&gt; / &lt;code&gt;install()&lt;/code&gt; / &lt;code&gt;get()&lt;/code&gt;, positional &lt;code&gt;rpId&lt;/code&gt;, options fields, "works in all browsers", v1.61.0 shipped 2026-06-15) verified in the source blog against the Playwright v1.61.0 release notes + Credentials docs (2026-06-22). See &lt;code&gt;2026-06-24-blog-test-passkey-login-playwright.md&lt;/code&gt; § Internal References.&lt;/li&gt;
&lt;li&gt;Cover is the 1000×420 dev.to-native export of the blog d-variant hero (same visual asset, dev.to size — AGENTS.md Rule 9 research-reuse clause; no fresh Subscribr needed). Source HTML: &lt;code&gt;assets/devto-covers/var-test-passkey-login-playwright.html&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;dev.to publishes via &lt;code&gt;scripts/devto_publish.py --drain&lt;/code&gt; (hourly GitHub Action) at &lt;code&gt;publish-at&lt;/code&gt;; &lt;code&gt;status: scheduled&lt;/code&gt; + &lt;code&gt;publish-at&lt;/code&gt; are the trigger. Tags ≤4, canonical globally unique.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>playwright</category>
      <category>testing</category>
      <category>webauthn</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Playwright Best Practices: 10 Rules AI Agents Get Wrong (2026)</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sat, 20 Jun 2026 15:51:36 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-best-practices-10-rules-ai-agents-get-wrong-2026-5cii</link>
      <guid>https://dev.to/aiwithanton/playwright-best-practices-10-rules-ai-agents-get-wrong-2026-5cii</guid>
      <description>&lt;p&gt;&lt;strong&gt;Playwright best practices&lt;/strong&gt; are the rules that keep browser tests stable and easy to read. Use role-based locators (find by what users see), web-first assertions that auto-wait, and isolated tests. Seed data through the API (direct requests), not the UI. Avoid hard waits, conditional logic, and tests tied to your HTML. Turn on traces and run in parallel.&lt;/p&gt;

&lt;p&gt;An AI agent can write 50 Playwright tests in a minute. That feels fast.&lt;/p&gt;

&lt;p&gt;Then those tests fail at random, and nobody knows why. The agent copied old patterns from its training data. It does not know the run failed last night.&lt;/p&gt;

&lt;p&gt;This guide lists the 10 best practices that keep tests stable. For each one, I show a small correct example. I also show what AI code agents get wrong. AI tools like Copilot, Cursor, and even Playwright codegen (the test recorder) lean on stale habits. Someone has to fix that.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Find elements the way a user sees them
&lt;/h2&gt;

&lt;p&gt;A locator (a pointer to an element) should match what a person sees on screen. Use &lt;code&gt;getByRole&lt;/code&gt;, &lt;code&gt;getByLabel&lt;/code&gt;, or &lt;code&gt;getByText&lt;/code&gt;. These read like the page. They also survive a redesign of your HTML.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user can sign in&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/login&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Email&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ada@example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Sign in&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they reach for CSS or XPath (brittle path selectors) like &lt;code&gt;page.locator('div.btn-primary &amp;gt; span')&lt;/code&gt;. Change one class name and the test breaks.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Use web-first assertions that wait for you
&lt;/h2&gt;

&lt;p&gt;A web-first assertion (a check that auto-waits) retries until the page is ready. &lt;code&gt;expect(locator).toBeVisible()&lt;/code&gt; waits on its own. You never add a fixed sleep.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;welcome message appears&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/dashboard&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Welcome back&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they add &lt;code&gt;await page.waitForTimeout(3000)&lt;/code&gt; (a hard pause). Hard waits are the top cause of flaky tests (tests that fail at random). Too short, the test fails. Too long, the suite crawls.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Keep every test isolated
&lt;/h2&gt;

&lt;p&gt;Isolated means each test starts clean. No shared login. No leftover data from the test before. Playwright gives each test a fresh browser context (a clean session). Set up state in a hook, not across tests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;beforeEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/login&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Email&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ada@example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Sign in&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;shows the account name&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;heading&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Ada Lovelace&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})).&lt;/span&gt;&lt;span class="nf"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they chain tests, where test 2 needs test 1 to run first. One failure then breaks the whole file.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Seed state through the API, not the UI
&lt;/h2&gt;

&lt;p&gt;To test a page, you often need data first. A user, an order, a draft. Do not click through ten screens to make it. Send the data straight to your backend with the &lt;code&gt;request&lt;/code&gt; fixture (a built-in HTTP client). It is faster and steadier.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;opens an existing project&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/projects&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Apollo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;toBeTruthy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/projects&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Apollo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they build the data through the UI every time. The test gets long and slow, and a setup step fails for reasons that have nothing to do with the real check.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Do not lean on test IDs by default
&lt;/h2&gt;

&lt;p&gt;A test ID (a tag added just for tests, like &lt;code&gt;data-testid&lt;/code&gt;) works as a fallback. But reach for &lt;code&gt;getByRole&lt;/code&gt; and &lt;code&gt;getByLabel&lt;/code&gt; first. Those test what a real user can do. A test ID only proves an attribute exists.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cart shows one item&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/cart&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Prefer a real role over a test id.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;listitem&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;toHaveCount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they paste &lt;code&gt;data-testid&lt;/code&gt; on everything. The tests pass even when the button has no label and a screen reader (assistive software) cannot find it. The test misses a real bug.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Turn on traces for the first retry
&lt;/h2&gt;

&lt;p&gt;A trace (a full recording of the run) shows every step, the DOM, and the network. Set it to record only on the first retry of a failed test. You get the evidence for failures, and clean runs stay fast.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// playwright.config.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineConfig&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineConfig&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;on-first-retry&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they leave tracing off, or set &lt;code&gt;trace: 'on'&lt;/code&gt; for every run. Off means no evidence when a test fails. Always-on slows the suite and fills your disk.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Run tests in parallel and shard them
&lt;/h2&gt;

&lt;p&gt;Parallel means many tests run at once. Playwright does this by default. For one big file of independent tests, set parallel mode. To split a slow suite across machines, use sharding (run a slice per machine).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;parallel&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;loads home&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toHaveTitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/Home/&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Split across three machines on CI (your build server):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx playwright &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--shard&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1/3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they write tests that share a database row or a single user. Run those in parallel and they fight each other, so you get flaky tests.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Keep &lt;code&gt;if&lt;/code&gt; and &lt;code&gt;try&lt;/code&gt; out of your tests
&lt;/h2&gt;

&lt;p&gt;A test should walk one clear path. No branching. If a test asks "is the button there? if so click it," it hides a bug. The button should always be there. Assert it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;checkout button works&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/cart&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Assert the state. Do not guess it with an if.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;checkout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Checkout&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;checkout&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeEnabled&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;checkout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they wrap clicks in &lt;code&gt;if (await locator.isVisible())&lt;/code&gt; to stop errors. That hides the real failure. A test that skips its own check still goes green.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Test what users see, not how it is built
&lt;/h2&gt;

&lt;p&gt;Test the behavior, not the internals. Check the visible result. Do not check a CSS class, a state variable, or a function name. Those change when you refactor (rewrite the code), even though the app still works.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;shows a success message after submit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/contact&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Message&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Send&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="c1"&gt;// Check the user-facing result, not an internal class.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Thanks, we got your message&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they assert on &lt;code&gt;class="is-active"&lt;/code&gt; or an exact HTML shape. The test breaks on every redesign, even when nothing real changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Define projects in your config
&lt;/h2&gt;

&lt;p&gt;A project (a named test setup) in &lt;code&gt;playwright.config.ts&lt;/code&gt; runs the same tests under different settings. Use projects to cover Chromium, Firefox, and WebKit (the three main browser engines). One config, full coverage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// playwright.config.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;devices&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nf"&gt;defineConfig&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;testDir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./tests&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;fullyParallel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;on-first-retry&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;projects&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chromium&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Desktop Chrome&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;firefox&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Desktop Firefox&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;webkit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;use&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Desktop Safari&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What AI agents get wrong:&lt;/strong&gt; they hard-code one browser, or copy a config with no &lt;code&gt;projects&lt;/code&gt; array. The suite then tests Chrome only, and a Safari-only bug ships to users.&lt;/p&gt;




&lt;h2&gt;
  
  
  The human still owns the standard
&lt;/h2&gt;

&lt;p&gt;AI writes the first draft fast. That part is real, and it is useful. But the first draft copies patterns from old code on the internet. It adds hard waits. It clicks through the UI to seed data. It wraps fragile steps in &lt;code&gt;if&lt;/code&gt; blocks so the run stays green.&lt;/p&gt;

&lt;p&gt;A green suite that proves nothing is worse than no suite. It buys false trust.&lt;/p&gt;

&lt;p&gt;So the workflow is simple. Let the agent write the draft. Then a human reads it against these 10 rules and fixes what the agent got wrong. The agent moves fast. The human keeps the tests honest. That is the job of an AI QA Architect.&lt;/p&gt;

&lt;p&gt;Build the tests with AI. Then make them stable yourself.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>playwright</category>
      <category>testing</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I measure how fast 42 LLMs actually answer. Here's the honest method.</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Tue, 16 Jun 2026 05:06:22 +0000</pubDate>
      <link>https://dev.to/aiwithanton/i-measure-how-fast-42-llms-actually-answer-heres-the-honest-method-3gap</link>
      <guid>https://dev.to/aiwithanton/i-measure-how-fast-42-llms-actually-answer-heres-the-honest-method-3gap</guid>
      <description>&lt;p&gt;I test software for a living. So when a vendor calls an AI model "fast," I don't trust the word. I measure it.&lt;/p&gt;

&lt;p&gt;Most leaderboards rank how smart a model is. Almost none rank how fast it answers. You pick a model because it scored well, ship it, and then your users sit and wait.&lt;/p&gt;

&lt;p&gt;Speed is two different numbers. People mix them up constantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two numbers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Time to first token (TTFT).&lt;/strong&gt; The wait before the first word appears. You feel this every time a chatbot "thinks" before replying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tokens per second (TPS).&lt;/strong&gt; How fast the model writes once it starts. A token is a chunk of a word.&lt;/p&gt;

&lt;p&gt;A model can be great at one and terrible at the other. You need both.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I measure it
&lt;/h2&gt;

&lt;p&gt;I run an independent tracker called &lt;a href="https://ollamatps.com" rel="noopener noreferrer"&gt;ollamatps.com&lt;/a&gt;. It benchmarks 42 Ollama Cloud models. Here is the exact method, because a benchmark you cannot inspect is just a claim.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One fixed prompt.&lt;/strong&gt; Every run asks the model to write a 400-word explanation of HTTP routing. Same prompt, every model, every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fixed output cap.&lt;/strong&gt; &lt;code&gt;max_tokens&lt;/code&gt; is capped at 300. It never changes between runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every ~10 minutes.&lt;/strong&gt; Each model is re-tested continuously, not once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; is measured from the moment the request is sent to the first non-empty content chunk. It includes network round-trip and prompt-processing time. That is honest, because it is what you actually wait.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TPS&lt;/strong&gt; is generation throughput only: &lt;code&gt;eval_count / (total_duration - time_to_first_token)&lt;/code&gt;. The startup wait is removed, so TPS measures pure writing speed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same prompt, same cap, same schedule. That is what makes two models comparable.&lt;/p&gt;

&lt;p&gt;Building this was a testing job, not a coding job. Retries on failure. A reliability score per model. A circuit breaker for models that keep failing. If you cannot trust the measurement, the number is noise. That part is the same work I do on any test system.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Bigger is not faster.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The fastest model on the board is one of the smallest: a 30B model at over 200 tokens per second. The model literally named "ultra" is dead last, under 8 tokens per second.&lt;/p&gt;

&lt;p&gt;And the wait varies wildly. TTFT ranges from about 0.3 seconds to 23 seconds across the 42 models. Same cloud. Roughly 80x difference in how long you wait for the first word.&lt;/p&gt;

&lt;p&gt;If you picked your model on a benchmark score alone, you have no idea which of these you are getting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I rebuilt it
&lt;/h2&gt;

&lt;p&gt;The first version tracked fewer models and was less robust. I rebuilt the engine this month (v2) to be multi-provider and to test continuously. The live board updates every 10 minutes.&lt;/p&gt;

&lt;p&gt;Watch it run: &lt;strong&gt;&lt;a href="https://ollamatps.com" rel="noopener noreferrer"&gt;ollamatps.com&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Anton. I build AI tools and measure how fast they really run. Former Apple engineer. I make &lt;a href="https://ollamatps.com" rel="noopener noreferrer"&gt;ollamatps.com&lt;/a&gt; and write about building and measuring AI at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>benchmarking</category>
      <category>performance</category>
    </item>
    <item>
      <title>How to Score Your AI Test Agents: Offline Evaluation with Trajectories (2026)</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sat, 13 Jun 2026 20:37:36 +0000</pubDate>
      <link>https://dev.to/aiwithanton/how-to-score-your-ai-test-agents-offline-evaluation-with-trajectories-2026-dil</link>
      <guid>https://dev.to/aiwithanton/how-to-score-your-ai-test-agents-offline-evaluation-with-trajectories-2026-dil</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmsavzz8yegf78u3u4zj.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmsavzz8yegf78u3u4zj.webp" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI test agent evaluation&lt;/strong&gt; is the practice of scoring the tests an AI agent writes, instead of trusting that they pass. You record the agent's run as a trajectory (a saved log of every step), replay it offline, and grade each step for correctness and relevance. Offline scoring needs no live API calls, so you can check agent quality on every pull request.&lt;/p&gt;

&lt;p&gt;An AI agent can write 200 tests before lunch. That feels like progress.&lt;/p&gt;

&lt;p&gt;Then a real bug ships, and not one of those tests caught it. The agent was confident, and it was wrong.&lt;/p&gt;

&lt;p&gt;This guide shows how to stop guessing and start scoring. Stagehand 3.5.0 made the method first-class on June 3, 2026, but the pattern works for any agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. "It passed" is not a score
&lt;/h2&gt;

&lt;p&gt;A green test suite tells you the tests ran. It does not tell you the tests were right.&lt;/p&gt;

&lt;p&gt;An AI agent makes three mistakes a human reviewer would catch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It checks the wrong thing. The test passes, but it never asserts the real behavior.&lt;/li&gt;
&lt;li&gt;It writes flaky tests (tests that fail at random). They go green often enough to look fine.&lt;/li&gt;
&lt;li&gt;It tests a happy path and skips the edge case that actually breaks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You cannot fix what you cannot measure. So the first job is a number, not a vibe.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Record the run as a trajectory
&lt;/h2&gt;

&lt;p&gt;A trajectory is a saved recording of an agent's run. It captures each step: what the agent saw, what it decided, and what code it produced.&lt;/p&gt;

&lt;p&gt;You capture it once, during the agent's normal run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Illustrative pattern — confirm the exact Stagehand 3.5 API before use.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trajectory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveTrajectory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;trajectory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;runs/checkout-flow.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The recording is the receipt. Now you can study the run after it finishes, as many times as you want.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Replay it offline
&lt;/h2&gt;

&lt;p&gt;Offline means you grade the saved run without calling the live model again. No new API cost. No flaky network. Same input every time.&lt;/p&gt;

&lt;p&gt;This matters for two reasons. It makes scoring cheap, so you can run it on every pull request. It makes scoring repeatable, so two engineers get the same result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Replay the saved run and score it, with no live API calls.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;loadTrajectory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;runs/checkout-flow.json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Score each step with evaluation types
&lt;/h2&gt;

&lt;p&gt;A single pass/fail hides too much. Grade the run on a few clear axes instead.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Correctness&lt;/strong&gt;: did the test assert the behavior the task asked for?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relevance&lt;/strong&gt;: does each step move toward the goal, or wander?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stability&lt;/strong&gt;: would this test pass on a clean re-run, or is it flaky?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage&lt;/strong&gt;: did the agent test the edge case, or only the happy path?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stagehand 3.5.0 added evaluation types for exactly this kind of offline scoring. You define the rubric once and apply it to every saved run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rubric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;correctness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;asserts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;target&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;goal&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;relevance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;every&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onTask&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;stability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reruns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;every&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A run that scores &lt;code&gt;correctness 7/10, relevance pass, flaky tests 0&lt;/code&gt; is a run you can talk about. "It passed" is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Wire the score into CI
&lt;/h2&gt;

&lt;p&gt;A score you read once and forget changes nothing. Turn it into a gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CI step: fail the build if the agent's tests score too low.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npx evaluate runs/ --min-correctness 0.8 --max-flaky &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the agent earns trust the same way a junior engineer does. It ships work, the work gets graded, and only graded work reaches production.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Where this sits: the Evidence Layer
&lt;/h2&gt;

&lt;p&gt;I design AI test systems on a 3-Layer System:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt;: decides what to test.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution&lt;/strong&gt;: runs the tests, where the agent writes code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence&lt;/strong&gt;: proves the work is right.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams build the first two layers and stop. They let the agent write tests and assume the green check means quality.&lt;/p&gt;

&lt;p&gt;Offline evaluation is the Evidence Layer. It is the difference between an agent you hope works and an agent you can prove works.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 5-line checklist
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Record every agent run as a trajectory.&lt;/li&gt;
&lt;li&gt;Replay it offline, with no live API calls.&lt;/li&gt;
&lt;li&gt;Score it on correctness, relevance, stability, and coverage.&lt;/li&gt;
&lt;li&gt;Gate your build on the score.&lt;/li&gt;
&lt;li&gt;Keep the trajectory, so you can re-grade when the rubric improves.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Build the agent. Then prove it works.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>ai</category>
      <category>automation</category>
      <category>playwright</category>
    </item>
    <item>
      <title>Playwright Codegen: The Complete Guide (2026)</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 08 Jun 2026 05:38:38 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-codegen-the-complete-guide-2026-3kf0</link>
      <guid>https://dev.to/aiwithanton/playwright-codegen-the-complete-guide-2026-3kf0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdaabap7z1zkbj7w3zkc.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdaabap7z1zkbj7w3zkc.webp" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playwright Codegen&lt;/strong&gt; is a native CLI (command-line interface) tool that generates test scripts automatically as you interact with a browser. It records your actions—like clicks, form inputs, and page navigation—and translates them into clean TypeScript or JavaScript test code.&lt;/p&gt;

&lt;p&gt;For most developers, writing test locators (how tests find buttons) takes up 60% of test writing time. &lt;/p&gt;

&lt;p&gt;Codegen reduces that time to zero. &lt;/p&gt;

&lt;p&gt;Here is how to use it, and how to scale it from a simple draft tool to a full production architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. How to Launch Playwright Codegen
&lt;/h2&gt;

&lt;p&gt;To start the generator, run this command in your terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx playwright codegen demo.playwright.dev/todomvc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This launch command opens two windows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A browser window&lt;/strong&gt;: This is where you click, type, and record your test steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Playwright Inspector&lt;/strong&gt;: This is a tool window that displays the generated code in real time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As you click on the page, the tool writes the test code automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Capturing Assertions
&lt;/h2&gt;

&lt;p&gt;A test without assertions (checks to verify behavior) is just a script. &lt;/p&gt;

&lt;p&gt;Codegen allows you to record checks directly from the UI. &lt;/p&gt;

&lt;p&gt;In the browser window, hover over any element and click one of the check buttons in the toolbar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Assert Visibility&lt;/strong&gt;: Verifies if an element is visible on the screen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assert Text&lt;/strong&gt;: Verifies if an element contains specific text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assert Value&lt;/strong&gt;: Verifies the input value of a form field.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This generates standard assertions like &lt;code&gt;await expect(locator).toBeVisible()&lt;/code&gt; instantly.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Playwright Codegen Best Practices
&lt;/h2&gt;

&lt;p&gt;Generated code is a draft. &lt;/p&gt;

&lt;p&gt;To make it production-ready, apply these three rules:&lt;/p&gt;

&lt;h3&gt;
  
  
  Avoid Hardcoded Wait Times
&lt;/h3&gt;

&lt;p&gt;Codegen does not generate sleep statements. &lt;/p&gt;

&lt;p&gt;Playwright uses auto-waiting (waiting for elements to be ready). &lt;/p&gt;

&lt;p&gt;Keep it that way. &lt;/p&gt;

&lt;p&gt;Do not add manual timeouts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Semantic Locators
&lt;/h3&gt;

&lt;p&gt;Playwright prefers locators that represent user actions. &lt;/p&gt;

&lt;p&gt;Codegen generates these by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Good: accessible locator&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Submit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// Bad: fragile CSS selector&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#submit-btn-2&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep the accessible versions. &lt;/p&gt;

&lt;p&gt;They prevent flaky tests (tests that fail randomly).&lt;/p&gt;

&lt;h3&gt;
  
  
  Isolate Your Auth State
&lt;/h3&gt;

&lt;p&gt;Do not record login steps in every single test. &lt;/p&gt;

&lt;p&gt;Use Codegen to save your authentication state once. &lt;/p&gt;

&lt;p&gt;Run Codegen with this save option:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx playwright codegen &lt;span class="nt"&gt;--save-storage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;auth.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, configure your tests to load &lt;code&gt;auth.json&lt;/code&gt; before running. &lt;/p&gt;

&lt;p&gt;This saves hours of run time in CI (continuous integration servers).&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Architectural View: From Draft to System
&lt;/h2&gt;

&lt;p&gt;As an AI QA Architect, I view Codegen as a helper. &lt;/p&gt;

&lt;p&gt;It is the entry point of the &lt;strong&gt;Execution Layer&lt;/strong&gt; in the 3-Layer System:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt;: Decides when to run tests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution&lt;/strong&gt;: The code that runs (where Codegen helps).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence&lt;/strong&gt;: Gathers logs and traces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Codegen writes the initial code. &lt;/p&gt;

&lt;p&gt;But it cannot design the framework. &lt;/p&gt;

&lt;p&gt;It cannot handle API mocks (fake servers). &lt;/p&gt;

&lt;p&gt;It cannot govern agentic testing systems (where AI agents write and heal tests).&lt;/p&gt;

&lt;p&gt;Use Codegen to build the first block. &lt;/p&gt;

&lt;p&gt;Then build the architecture around it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>playwright</category>
      <category>testing</category>
      <category>automation</category>
      <category>devops</category>
    </item>
    <item>
      <title>Playwright vs Cypress vs Selenium in 2026</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sun, 24 May 2026 06:59:50 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-vs-cypress-vs-selenium-in-2026-35fg</link>
      <guid>https://dev.to/aiwithanton/playwright-vs-cypress-vs-selenium-in-2026-35fg</guid>
      <description>&lt;p&gt;Playwright is the best default for new browser test automation in 2026. It gives cross-browser runs, parallel CI, API checks, and AI-agent evidence in one tool. Cypress still fits JavaScript-heavy teams that want fast local feedback. Selenium still fits legacy grids and strict browser labs.&lt;/p&gt;

&lt;p&gt;That is the short answer.&lt;/p&gt;

&lt;p&gt;The better answer depends on your system.&lt;/p&gt;

&lt;p&gt;If AI agents will read your failures, the question changes.&lt;br&gt;
You are no longer picking only a test runner.&lt;br&gt;
You are picking the evidence layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed In 2026
&lt;/h2&gt;

&lt;p&gt;Most comparison posts still ask old questions.&lt;/p&gt;

&lt;p&gt;They ask which tool has cleaner syntax.&lt;br&gt;
They ask which tool is easier to learn.&lt;br&gt;
They ask which tool starts faster.&lt;/p&gt;

&lt;p&gt;Those questions still matter.&lt;br&gt;
They are no longer enough.&lt;/p&gt;

&lt;p&gt;AI agents need proof they can inspect.&lt;br&gt;
Proof means screenshots, traces, browser state, and readable failures.&lt;/p&gt;

&lt;p&gt;The human reviewer still owns the decision.&lt;br&gt;
The agent only helps when the evidence is clear.&lt;/p&gt;

&lt;p&gt;That is why Playwright now has the default seat.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pick Playwright When Evidence Matters
&lt;/h2&gt;

&lt;p&gt;Pick Playwright for new end-to-end test systems.&lt;br&gt;
End-to-end means browser checks.&lt;/p&gt;

&lt;p&gt;Playwright gives you one model across Chromium, Firefox, and WebKit.&lt;br&gt;
Those are browser engines.&lt;br&gt;
They are how pages run.&lt;/p&gt;

&lt;p&gt;That matters for real product risk.&lt;/p&gt;

&lt;p&gt;It also matters for AI-agent workflows.&lt;br&gt;
AI agents means tools that act.&lt;/p&gt;

&lt;p&gt;Playwright now documents Test Agents.&lt;br&gt;
Those agents plan, generate, and repair tests.&lt;/p&gt;

&lt;p&gt;The tool also has strong receipts.&lt;br&gt;
Traces show what happened.&lt;br&gt;
Screenshots show where it happened.&lt;br&gt;
Reports help humans review the failure.&lt;/p&gt;

&lt;p&gt;Use Playwright when you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser coverage across engines&lt;/li&gt;
&lt;li&gt;parallel CI at scale&lt;/li&gt;
&lt;li&gt;trace-based debugging&lt;/li&gt;
&lt;li&gt;API and UI checks together&lt;/li&gt;
&lt;li&gt;AI-agent review paths&lt;/li&gt;
&lt;li&gt;long-term framework ownership&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CI means server test runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pick Cypress When Local Feedback Matters Most
&lt;/h2&gt;

&lt;p&gt;Cypress is still useful.&lt;/p&gt;

&lt;p&gt;That sentence matters.&lt;br&gt;
Tool debates get lazy when one side becomes a villain.&lt;/p&gt;

&lt;p&gt;Cypress can be a strong fit for frontend teams.&lt;br&gt;
It works well when developers want quick local feedback.&lt;br&gt;
It also fits teams already built around Cypress Cloud.&lt;/p&gt;

&lt;p&gt;Cypress documents cross-browser testing.&lt;br&gt;
It also documents parallel runs through Cypress Cloud.&lt;/p&gt;

&lt;p&gt;That can be enough for many product teams.&lt;/p&gt;

&lt;p&gt;Use Cypress when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your app is JavaScript-first&lt;/li&gt;
&lt;li&gt;developers own most browser checks&lt;/li&gt;
&lt;li&gt;fast local debugging is the main goal&lt;/li&gt;
&lt;li&gt;Cypress Cloud is already approved&lt;/li&gt;
&lt;li&gt;browser coverage needs are narrow&lt;/li&gt;
&lt;li&gt;the suite is not agent-driven yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The risk appears later.&lt;/p&gt;

&lt;p&gt;As the suite grows, evidence gets more important.&lt;br&gt;
That is where Playwright usually wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keep Selenium When Migration Risk Is Higher
&lt;/h2&gt;

&lt;p&gt;Selenium is not dead.&lt;/p&gt;

&lt;p&gt;It is still the right answer for some teams.&lt;/p&gt;

&lt;p&gt;Keep Selenium when a grid already exists.&lt;br&gt;
Keep it when policy requires it.&lt;br&gt;
Keep it when migration risk is higher than tool value.&lt;/p&gt;

&lt;p&gt;But do not choose Selenium by default for new AI QA work.&lt;/p&gt;

&lt;p&gt;You will spend too much time rebuilding the evidence layer.&lt;br&gt;
You will also carry older suite habits forward.&lt;/p&gt;

&lt;p&gt;Selenium can be stable.&lt;br&gt;
The question is whether it helps the next system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Need&lt;/th&gt;
&lt;th&gt;Best default&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;New AI-agent test system&lt;/td&gt;
&lt;td&gt;Playwright&lt;/td&gt;
&lt;td&gt;Best evidence path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broad browser engine coverage&lt;/td&gt;
&lt;td&gt;Playwright&lt;/td&gt;
&lt;td&gt;One model across major engines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast frontend feedback&lt;/td&gt;
&lt;td&gt;Cypress&lt;/td&gt;
&lt;td&gt;Strong local developer loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Existing Cypress investment&lt;/td&gt;
&lt;td&gt;Cypress&lt;/td&gt;
&lt;td&gt;Migration may not pay yet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legacy grid policy&lt;/td&gt;
&lt;td&gt;Selenium&lt;/td&gt;
&lt;td&gt;Use what the organization can run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Greenfield QA architecture&lt;/td&gt;
&lt;td&gt;Playwright&lt;/td&gt;
&lt;td&gt;Better long-term receipts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  My Four-Question Test
&lt;/h2&gt;

&lt;p&gt;I use four questions before I choose.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Who reads the failure first?&lt;/li&gt;
&lt;li&gt;What proof do they need?&lt;/li&gt;
&lt;li&gt;Where will the suite run?&lt;/li&gt;
&lt;li&gt;What happens when the UI changes?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer includes AI agents, I lean Playwright.&lt;/p&gt;

&lt;p&gt;If the answer is one frontend team, Cypress can fit.&lt;/p&gt;

&lt;p&gt;If the answer is legacy policy, keep Selenium.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Practical Recommendation
&lt;/h2&gt;

&lt;p&gt;Start new projects with Playwright.&lt;/p&gt;

&lt;p&gt;Keep Cypress when it already serves the team.&lt;/p&gt;

&lt;p&gt;Keep Selenium when migration would create more risk.&lt;/p&gt;

&lt;p&gt;Then build the same rule across all three:&lt;/p&gt;

&lt;p&gt;Every failed test needs a receipt.&lt;/p&gt;

&lt;p&gt;That receipt should show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what action ran&lt;/li&gt;
&lt;li&gt;what page state existed&lt;/li&gt;
&lt;li&gt;what assertion failed&lt;/li&gt;
&lt;li&gt;what changed before failure&lt;/li&gt;
&lt;li&gt;what a human must decide&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tool only matters because the evidence matters.&lt;/p&gt;

&lt;p&gt;In 2026, that is the real comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  Author Bio
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>testing</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Playwright v1.60 Turns Test Failures Into Evidence</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 18 May 2026 04:54:00 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-v160-turns-test-failures-into-evidence-1ban</link>
      <guid>https://dev.to/aiwithanton/playwright-v160-turns-test-failures-into-evidence-1ban</guid>
      <description>&lt;p&gt;Playwright v1.60 makes failure evidence easier to capture during the run.&lt;/p&gt;

&lt;p&gt;The main change is scoped HAR recording.&lt;/p&gt;

&lt;p&gt;HAR means network request file.&lt;/p&gt;

&lt;p&gt;It shows what the browser sent and received.&lt;/p&gt;

&lt;p&gt;The release also adds file drops, ARIA boxes, and hard test aborts.&lt;/p&gt;

&lt;p&gt;ARIA means accessibility map.&lt;/p&gt;

&lt;p&gt;Together, these changes help CI failures explain themselves.&lt;/p&gt;

&lt;p&gt;CI means automated build server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical update
&lt;/h2&gt;

&lt;p&gt;Use &lt;code&gt;context.tracing.startHar()&lt;/code&gt; when network failures waste review time.&lt;/p&gt;

&lt;p&gt;It records a HAR file inside Playwright tracing.&lt;/p&gt;

&lt;p&gt;Tracing means run evidence capture.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;locator.drop()&lt;/code&gt; when upload tests use custom events.&lt;/p&gt;

&lt;p&gt;Drop API means file drop simulation.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;page.ariaSnapshot({ boxes: true })&lt;/code&gt; when AI tools inspect pages.&lt;/p&gt;

&lt;p&gt;Boxes mean element positions.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;test.abort()&lt;/code&gt; when shared setup finds unsafe state.&lt;/p&gt;

&lt;p&gt;Fixtures mean shared test setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;upload records network evidence&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;using&lt;/span&gt; &lt;span class="nx"&gt;har&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tracing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startHar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;upload.har&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;embed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;minimal&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;urlFilter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;api&lt;/span&gt;&lt;span class="se"&gt;\/&lt;/span&gt;&lt;span class="sr"&gt;upload/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/upload&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#dropzone&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;note.txt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;mimeType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text/plain&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Upload complete&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;toBeVisible&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The HAR starts before the page opens.&lt;/p&gt;

&lt;p&gt;The drop step sends an in-memory file.&lt;/p&gt;

&lt;p&gt;When the test scope ends, Playwright finalizes the HAR.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule
&lt;/h2&gt;

&lt;p&gt;Do not treat this release as a feature list.&lt;/p&gt;

&lt;p&gt;Treat it as an evidence upgrade.&lt;/p&gt;

&lt;p&gt;Better tests do not just pass or fail.&lt;/p&gt;

&lt;p&gt;They explain what happened.&lt;/p&gt;

&lt;p&gt;Read the canonical version:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anton.qa/blog/posts/playwright-v1-60-evidence-first-testing" rel="noopener noreferrer"&gt;https://www.anton.qa/blog/posts/playwright-v1-60-evidence-first-testing&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>automation</category>
      <category>news</category>
      <category>testing</category>
      <category>tooling</category>
    </item>
    <item>
      <title>AI Test Automation Architecture: The 3-Layer System</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Sun, 17 May 2026 23:53:47 +0000</pubDate>
      <link>https://dev.to/aiwithanton/ai-test-automation-architecture-the-3-layer-system-2078</link>
      <guid>https://dev.to/aiwithanton/ai-test-automation-architecture-the-3-layer-system-2078</guid>
      <description>&lt;p&gt;AI test automation architecture is the system that tells AI what to test.&lt;/p&gt;

&lt;p&gt;It also defines how to run tests and prove the result.&lt;/p&gt;

&lt;p&gt;I split it into three layers: orchestration, execution, and evidence.&lt;/p&gt;

&lt;p&gt;Without all three, AI testing becomes prompt output with no production gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why tool lists fail
&lt;/h2&gt;

&lt;p&gt;Most AI testing content starts with tools.&lt;/p&gt;

&lt;p&gt;That is backwards.&lt;/p&gt;

&lt;p&gt;AI means software that predicts.&lt;/p&gt;

&lt;p&gt;Predictions can help QA teams move faster.&lt;/p&gt;

&lt;p&gt;But predictions do not prove quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-layer model
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Plain meaning&lt;/th&gt;
&lt;th&gt;Main question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;test control plan&lt;/td&gt;
&lt;td&gt;What risk should this cover?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Execution&lt;/td&gt;
&lt;td&gt;actual test run&lt;/td&gt;
&lt;td&gt;Did it run in the real pipeline?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence&lt;/td&gt;
&lt;td&gt;proof from runs&lt;/td&gt;
&lt;td&gt;Can a human review it?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The practical gate
&lt;/h2&gt;

&lt;p&gt;Use this before AI-generated tests ship:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gate&lt;/th&gt;
&lt;th&gt;Pass condition&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;The test maps to one named risk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data&lt;/td&gt;
&lt;td&gt;Test data setup is explicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State&lt;/td&gt;
&lt;td&gt;Browser state is controlled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run&lt;/td&gt;
&lt;td&gt;The test passes in CI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evidence&lt;/td&gt;
&lt;td&gt;Trace or equivalent proof exists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Review&lt;/td&gt;
&lt;td&gt;A human can explain the failure mode&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CI means automated build server.&lt;/p&gt;

&lt;p&gt;MCP means tool connection standard.&lt;/p&gt;

&lt;p&gt;Playwright is a browser test tool.&lt;/p&gt;

&lt;p&gt;Together, they can help AI agents run useful tests.&lt;/p&gt;

&lt;p&gt;But the architecture must prove each run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule
&lt;/h2&gt;

&lt;p&gt;Never ask AI to expand test coverage first.&lt;/p&gt;

&lt;p&gt;Build the proof system before that.&lt;/p&gt;

&lt;p&gt;Generation is cheap.&lt;/p&gt;

&lt;p&gt;Evidence is the architecture.&lt;/p&gt;

&lt;p&gt;Read the canonical version:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anton.qa/blog/posts/ai-test-automation-architecture-3-layer-system" rel="noopener noreferrer"&gt;https://www.anton.qa/blog/posts/ai-test-automation-architecture-3-layer-system&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>automation</category>
      <category>testing</category>
    </item>
    <item>
      <title>How to Test MCP Servers Before They Break Your CI</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 11 May 2026 18:15:50 +0000</pubDate>
      <link>https://dev.to/aiwithanton/how-to-test-mcp-servers-before-they-break-your-ci-p7b</link>
      <guid>https://dev.to/aiwithanton/how-to-test-mcp-servers-before-they-break-your-ci-p7b</guid>
      <description>&lt;p&gt;Most teams install an MCP server and hope it works.&lt;/p&gt;

&lt;p&gt;That is how you get 3 AM pages.&lt;/p&gt;

&lt;p&gt;An MCP server is a bridge between AI agents and your tools. It can crash, leak data, or silently return garbage. If your AI agent relies on it, your whole pipeline breaks.&lt;/p&gt;

&lt;p&gt;MCP means Model Context Protocol (standard tool link).&lt;/p&gt;

&lt;p&gt;Do not only test startup. Test behavior and permissions too.&lt;/p&gt;

&lt;p&gt;This post is the checklist I run on every MCP server before it touches production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three-layer test stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Discovery&lt;/td&gt;
&lt;td&gt;Missing tools, broken metadata&lt;/td&gt;
&lt;td&gt;MCP Inspector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Behavior&lt;/td&gt;
&lt;td&gt;Silent failures, wrong output&lt;/td&gt;
&lt;td&gt;pytest smoke tests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Over-permissions, data leaks&lt;/td&gt;
&lt;td&gt;Permission audit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Layer 1: Discovery with MCP Inspector
&lt;/h2&gt;

&lt;p&gt;MCP Inspector is the official debugging tool. Start it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @anthropic-ai/mcp-inspector node dist/server.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Does the server start without errors?&lt;/li&gt;
&lt;li&gt;Does it list the tools it promises?&lt;/li&gt;
&lt;li&gt;Does a sample request return the right shape?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Layer 2: Behavior with pytest
&lt;/h2&gt;

&lt;p&gt;Here is a minimal smoke test. It checks that initialization returns valid JSON-RPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_mcp_server_responds&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Popen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@anthropic-ai/mcp-server-filesystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PIPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PIPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jsonrpc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;initialize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;protocolVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-11-05&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capabilities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:{},&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clientInfo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readline&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;terminate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 3: Security with a permission audit
&lt;/h2&gt;

&lt;p&gt;Check three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does it need file system access (disk read/write)? Which paths?&lt;/li&gt;
&lt;li&gt;Does it make network calls (external requests)? To which hosts?&lt;/li&gt;
&lt;li&gt;Does it run shell commands (terminal execution)? Under which user?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answers are "all files, any host, root user," block it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to find servers worth testing
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Official MCP Registry&lt;/strong&gt; — &lt;a href="https://registry.modelcontextprotocol.io" rel="noopener noreferrer"&gt;https://registry.modelcontextprotocol.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt; — Search &lt;code&gt;modelcontextprotocol&lt;/code&gt; topics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm / pip&lt;/strong&gt; — Search &lt;code&gt;@anthropic-ai/mcp-server-*&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Red flags: no commits in 6+ months, no tests, no README, permission requests that are too broad.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Testing MCP servers is not optional. An untested server is a bug waiting to become an incident.&lt;/p&gt;

&lt;p&gt;The three-layer stack catches common failure modes. MCP Inspector for manual checks. pytest for CI gates. Permission audit for last defense.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, now Lead Software Engineer in Test. &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cicd</category>
      <category>mcp</category>
      <category>testing</category>
    </item>
    <item>
      <title>Playwright MCP v0.0.73: How to Configure Browser Paths via Environment Variables</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Mon, 04 May 2026 21:37:17 +0000</pubDate>
      <link>https://dev.to/aiwithanton/playwright-mcp-v0073-how-to-configure-browser-paths-via-environment-variables-3fap</link>
      <guid>https://dev.to/aiwithanton/playwright-mcp-v0073-how-to-configure-browser-paths-via-environment-variables-3fap</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This post was originally published on &lt;a href="https://www.anton.qa/blog/posts/playwright-mcp-v0-0-73" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;. The canonical version lives there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Playwright MCP v0.0.73 fixes a critical gap where extension channels and executable paths could not be resolved from CI/CD environment variables.&lt;/p&gt;

&lt;p&gt;If you run Playwright MCP in Docker, Kubernetes, or ephemeral CI workers, this release removes a class of environment-specific debugging that typically consumes 15–30 minutes per incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;Two interconnected bug fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extension &lt;code&gt;channel&lt;/code&gt; and &lt;code&gt;executablePath&lt;/code&gt; now resolve from CLI flags and environment variables&lt;/strong&gt; (&lt;a href="https://github.com/microsoft/playwright/pull/40572" rel="noopener noreferrer"&gt;#40572&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--browser&lt;/code&gt; channel flags now propagate on &lt;code&gt;--extension&lt;/code&gt; paths&lt;/strong&gt; (&lt;a href="https://github.com/microsoft/playwright/pull/40567" rel="noopener noreferrer"&gt;#40567&lt;/a&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Combined, these changes mean your Playwright MCP setup can now be fully environment-driven.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PLAYWRIGHT_BROWSERS_CHANNEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;chromium
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PLAYWRIGHT_EXTENSION_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/browser-extension
npx playwright &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resolution hierarchy is now:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CLI flags (highest priority)&lt;/li&gt;
&lt;li&gt;Environment variables&lt;/li&gt;
&lt;li&gt;Config file defaults&lt;/li&gt;
&lt;li&gt;Built-in channel defaults&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  MCP Registry listing
&lt;/h2&gt;

&lt;p&gt;Playwright MCP is now published to the official &lt;a href="https://registry.modelcontextprotocol.io" rel="noopener noreferrer"&gt;MCP Registry&lt;/a&gt; on each release. This simplifies enterprise procurement and governance for teams evaluating AI-assisted testing infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha
&lt;/h2&gt;

&lt;p&gt;Environment variables set in your shell may not propagate to the MCP process spawned by your AI tool. Test this before deploying to production.&lt;/p&gt;

&lt;p&gt;For the full breakdown — including CI/CD examples and the subprocess propagation fix — read the canonical post at &lt;a href="https://www.anton.qa/blog/posts/playwright-mcp-v0-0-73" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is an AI QA Architect. Former Apple SDET, now Lead Software Engineer in Test. &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>cicd</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Native Drag-and-Drop Automation Arrives in Playwright MCP</title>
      <dc:creator>Anton Gulin</dc:creator>
      <pubDate>Tue, 28 Apr 2026 18:43:12 +0000</pubDate>
      <link>https://dev.to/aiwithanton/native-drag-and-drop-automation-arrives-in-playwright-mcp-3e16</link>
      <guid>https://dev.to/aiwithanton/native-drag-and-drop-automation-arrives-in-playwright-mcp-3e16</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Playwright MCP v0.0.71 ships &lt;code&gt;browser_drop&lt;/code&gt;. It gives you native drag-and-drop from any MCP client. No more &lt;code&gt;evaluate&lt;/code&gt; scripts. No more &lt;code&gt;mouse.move&lt;/code&gt; chains. Grid reordering, file drop zones, text editor drags — all work the same way a real user does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Release Matters
&lt;/h2&gt;

&lt;p&gt;QA teams either abandon drag-and-drop testing or hack around it. But sortable grids, file uploads, and rich text editors are everywhere. And they have been painful to test forever.&lt;/p&gt;

&lt;p&gt;I ran into this firsthand on one project. Solid Playwright coverage for clicks, typing, and navigation. But drag-and-drop? We used &lt;code&gt;evaluate&lt;/code&gt; scripts. Or we tested it by hand. Both paths broke across browsers. Both were impossible to keep working.&lt;/p&gt;

&lt;p&gt;Playwright MCP v0.0.71 fixes this with &lt;code&gt;browser_drop&lt;/code&gt;. It uses Playwright's own &lt;code&gt;Locator.drop&lt;/code&gt; — the same API your tests already use. Now any MCP client can call it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use browser_drop
&lt;/h2&gt;

&lt;p&gt;Here's a complete example combining browser_drop with the new response body capture from browser_network_requests and the simplified expression support in browser_evaluate. This pipeline automates a file upload scenario, validates the server response, and confirms the UI state update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;McpServer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/mcp.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;McpServer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;file-upload-automation&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Drop zone and file item selectors for a document management UI&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropZoneSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="upload-zone"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileItemSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="file-item"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;uploadedStatusSelector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="upload-status"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Tool: Simulate file drag-and-drop onto upload zone&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;upload_document_flow&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Upload a document via drag-and-drop and validate response&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Name of file to upload&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="na"&gt;fileId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Unique file identifier&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Navigate to upload interface&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_navigate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://internal-docs.example.com/upload&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Locate drag source and drop target&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dragSource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`text=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fileName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropTarget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dropZoneSelector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Execute native drag-and-drop operation&lt;/span&gt;
    &lt;span class="c1"&gt;// browser_drop wraps Locator.drop - no evaluate or mouse.move workarounds&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_drop&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dragSource&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;dropTarget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;dropResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Drop operation failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;dropResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Inspect server response body with mime-type detection&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;networkCapture&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_network_requests&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;urlPattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;**/api/upload**&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;responseHeaders&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Extract upload confirmation&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;uploadResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;networkCapture&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;No upload response captured&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Validate response using plain expression (no function wrapper needed)&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;validationResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`JSON.parse(arguments[0]).status === "success"`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Confirm UI reflects successful upload&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;statusText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`document.querySelector("&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;uploadedStatusSelector&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;")?.textContent`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;uploaded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;serverResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;uploadResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;responseBody&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;uiStatus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;statusText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;validationPassed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;validationResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three new tools working together: &lt;code&gt;browser_drop&lt;/code&gt; handles the drag. &lt;code&gt;browser_network_requests&lt;/code&gt; captures the server response (full body, not just status codes). &lt;code&gt;browser_evaluate&lt;/code&gt; runs plain JavaScript — no function wrapper needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gotcha Nobody Is Talking About
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;browser_drop&lt;/code&gt; needs both elements to be on screen. That's correct Playwright behavior. But here's the catch: if you navigate to a page and the drag target sits below the fold, the drop fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Call browser_evaluate to scroll the target into view before calling browser_drop, or use the scroll option if your Playwright version supports it. This catches teams off guard in CI where viewport sizes are smaller than local development.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before browser_drop: ensure target is in viewport&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;browser_evaluate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`document.querySelector("&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;dropTarget&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;").scrollIntoView()`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a bug. It's how Playwright works. But it catches teams when they test on a big screen and deploy to CI. CI viewports are smaller. The element you tested locally is off screen in the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Changes in Your CI Pipeline
&lt;/h2&gt;

&lt;p&gt;With &lt;code&gt;browser_drop&lt;/code&gt;, you can test drag-and-drop flows through MCP. Not by hand. Not with broken scripts.&lt;/p&gt;

&lt;p&gt;On one project, Selenium to Playwright gave us 40% faster tests. But drag-and-drop still broke in headless mode. We wrote &lt;code&gt;evaluate&lt;/code&gt; scripts. They stopped working every sprint. &lt;code&gt;browser_drop&lt;/code&gt; puts native drag-and-drop into MCP. No scripts. No workarounds.&lt;/p&gt;

&lt;p&gt;What this actually means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fewer flaky tests.&lt;/strong&gt; Native drag-and-drop is tested across browsers. &lt;code&gt;evaluate&lt;/code&gt; + &lt;code&gt;mouse.move&lt;/code&gt; sequences are not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simpler AI test generation.&lt;/strong&gt; AI tools call &lt;code&gt;browser_drop&lt;/code&gt; directly. No fragile mouse chains.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster CI.&lt;/strong&gt; Native operations run faster than JavaScript-injected drag scripts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Playwright MCP v0.0.71 is worth upgrading for &lt;code&gt;browser_drop&lt;/code&gt; alone. The response body capture and plain expression support make it better. But drag-and-drop was the missing piece. Now it's there.&lt;/p&gt;

&lt;p&gt;The catch is real but small. Scroll your target into view before you drop. One line. Add it to your tool definitions and move on.&lt;/p&gt;

&lt;p&gt;If you run MCP-based test infrastructure, this kills the last reason to fall back to &lt;code&gt;evaluate&lt;/code&gt; for drag-and-drop. Upgrade. Add the scroll guard. Ship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference&lt;/strong&gt;: &lt;a href="https://playwright.dev/docs/api/class-locator#locator-drop" rel="noopener noreferrer"&gt;Playwright Locator.drop API documentation&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, now Lead Software Engineer in Test. Find him at &lt;a href="https://anton.qa" rel="noopener noreferrer"&gt;anton.qa&lt;/a&gt; or on &lt;a href="https://linkedin.com/in/antongulin" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>mcp</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
