<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ksgisang</title>
    <description>The latest articles on DEV Community by ksgisang (@ksgisang).</description>
    <link>https://dev.to/ksgisang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3791322%2F03c314e0-9367-4e2d-aec0-288d86148c99.png</url>
      <title>DEV Community: ksgisang</title>
      <link>https://dev.to/ksgisang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ksgisang"/>
    <language>en</language>
    <item>
      <title>I built an open-source AI agent that writes and runs E2E tests — here's what I learned</title>
      <dc:creator>ksgisang</dc:creator>
      <pubDate>Wed, 25 Feb 2026 08:42:38 +0000</pubDate>
      <link>https://dev.to/ksgisang/i-built-an-open-source-ai-agent-that-writes-and-runs-e2e-tests-heres-what-i-learned-17bj</link>
      <guid>https://dev.to/ksgisang/i-built-an-open-source-ai-agent-that-writes-and-runs-e2e-tests-heres-what-i-learned-17bj</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Every new project, same story: write login tests, write form validation tests, write navigation tests. Copy-paste from the last project, tweak selectors, pray nothing breaks.&lt;/p&gt;

&lt;p&gt;After 25 years in IT, I decided to automate the boring part. I built &lt;strong&gt;AWT (AI Watch Tester)&lt;/strong&gt; — an open-source tool where you enter a URL, and AI writes the tests for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Enter a URL&lt;/strong&gt; — that's your only input&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI scans the page&lt;/strong&gt; — analyzes DOM structure + takes screenshots&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generates test scenarios&lt;/strong&gt; — login flows, form validation, navigation checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runs them with Playwright&lt;/strong&gt; — real browser, real clicks, real screenshots&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No selectors to write. No test scripts to maintain. AI handles the planning, Playwright handles the execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Can't Claude/GPT Just Do This with Computer Use?"
&lt;/h2&gt;

&lt;p&gt;Fair question. I get it a lot.&lt;/p&gt;

&lt;p&gt;Computer Use is a general-purpose GUI agent — it can click buttons and type text. But for E2E testing, you'd still need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker environment setup&lt;/li&gt;
&lt;li&gt;Screenshot pipeline management&lt;/li&gt;
&lt;li&gt;Result parsing and storage&lt;/li&gt;
&lt;li&gt;CI/CD integration&lt;/li&gt;
&lt;li&gt;Scenario tracking across runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And each test costs &lt;strong&gt;$0.50–2.00&lt;/strong&gt; because the AI processes every screenshot.&lt;/p&gt;

&lt;p&gt;AWT uses AI &lt;strong&gt;only for test generation&lt;/strong&gt; (analyzing what to test), then runs tests with Playwright — no per-screenshot AI cost. A typical scan costs &lt;strong&gt;$0.002–0.03&lt;/strong&gt;. That's &lt;strong&gt;10–100x cheaper&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Think of it this way: &lt;strong&gt;Computer Use is the hammer. AWT is the furniture store.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes It Different
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;AWT&lt;/th&gt;
&lt;th&gt;Playwright/Cypress&lt;/th&gt;
&lt;th&gt;testRigor/Applitools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Test writing&lt;/td&gt;
&lt;td&gt;AI writes them&lt;/td&gt;
&lt;td&gt;You write them&lt;/td&gt;
&lt;td&gt;AI assists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Free (MIT) + BYOK&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;$800+/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI provider&lt;/td&gt;
&lt;td&gt;Your choice (OpenAI, Anthropic, Ollama*)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Locked in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local mode&lt;/td&gt;
&lt;td&gt;Yes (Ollama, experimental)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*Ollama adapter is included but experimental — works best with larger models (70B+). Results may vary with smaller models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Limitations
&lt;/h2&gt;

&lt;p&gt;This is v1.0 by a solo developer. Let me be upfront:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Works well on &lt;strong&gt;simple login/form pages&lt;/strong&gt; (SauceDemo, standard auth flows)&lt;/li&gt;
&lt;li&gt;⚠️ Complex SPAs with heavy dynamic content — still improving&lt;/li&gt;
&lt;li&gt;⚠️ No cancel button for long scans yet&lt;/li&gt;
&lt;li&gt;⚠️ Free plan is limited (5 pages per scan)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backend&lt;/strong&gt;: Python, FastAPI, Playwright&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend&lt;/strong&gt;: Next.js, TypeScript&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt;: PostgreSQL (Supabase)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI&lt;/strong&gt;: OpenAI / Anthropic / Ollama adapters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;License&lt;/strong&gt;: MIT&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🌐 &lt;strong&gt;Cloud&lt;/strong&gt;: &lt;a href="https://ai-watch-tester.vercel.app" rel="noopener noreferrer"&gt;https://ai-watch-tester.vercel.app&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/ksgisang/AI-Watch-Tester" rel="noopener noreferrer"&gt;https://github.com/ksgisang/AI-Watch-Tester&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sign up → Settings → Enter your OpenAI key → Start scanning.&lt;/p&gt;

&lt;p&gt;Ollama adapter is also included for local execution, though it's still experimental — best results with larger models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned Building This
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI is great at generating test plans, bad at executing them.&lt;/strong&gt; That's why I separated generation (AI) from execution (Playwright). Trying to do both with AI is expensive and fragile.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Language detection matters.&lt;/strong&gt; My first users got Korean test scenarios on English sites. Lesson: always detect the target site's language before generating.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Assert validation is critical.&lt;/strong&gt; AI sometimes generates structurally invalid assertions. A post-processing validator that auto-corrects the schema saved me from shipping broken tests.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Bug reports, feedback, and PRs are all welcome. What edge cases should I try next?&lt;/p&gt;

</description>
      <category>testing</category>
      <category>opensource</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
