I built an open-source AI agent that writes and runs E2E tests — here's what I learned

ksgisang — Wed, 25 Feb 2026 08:42:38 +0000

The Problem

Every new project, same story: write login tests, write form validation tests, write navigation tests. Copy-paste from the last project, tweak selectors, pray nothing breaks.

After 25 years in IT, I decided to automate the boring part. I built AWT (AI Watch Tester) — an open-source tool where you enter a URL, and AI writes the tests for you.

How It Works

Enter a URL — that's your only input
AI scans the page — analyzes DOM structure + takes screenshots
Generates test scenarios — login flows, form validation, navigation checks
Runs them with Playwright — real browser, real clicks, real screenshots

No selectors to write. No test scripts to maintain. AI handles the planning, Playwright handles the execution.

"Can't Claude/GPT Just Do This with Computer Use?"

Fair question. I get it a lot.

Computer Use is a general-purpose GUI agent — it can click buttons and type text. But for E2E testing, you'd still need:

Docker environment setup
Screenshot pipeline management
Result parsing and storage
CI/CD integration
Scenario tracking across runs

And each test costs $0.50–2.00 because the AI processes every screenshot.

AWT uses AI only for test generation (analyzing what to test), then runs tests with Playwright — no per-screenshot AI cost. A typical scan costs $0.002–0.03. That's 10–100x cheaper.

Think of it this way: Computer Use is the hammer. AWT is the furniture store.

What Makes It Different

	AWT	Playwright/Cypress	testRigor/Applitools
Test writing	AI writes them	You write them	AI assists
Cost	Free (MIT) + BYOK	Free	$800+/mo
AI provider	Your choice (OpenAI, Anthropic, Ollama*)	N/A	Locked in
Local mode	Yes (Ollama, experimental)	Yes	No

*Ollama adapter is included but experimental — works best with larger models (70B+). Results may vary with smaller models.

The Honest Limitations

This is v1.0 by a solo developer. Let me be upfront:

✅ Works well on simple login/form pages (SauceDemo, standard auth flows)
⚠️ Complex SPAs with heavy dynamic content — still improving
⚠️ No cancel button for long scans yet
⚠️ Free plan is limited (5 pages per scan)

Tech Stack

Backend: Python, FastAPI, Playwright
Frontend: Next.js, TypeScript
Database: PostgreSQL (Supabase)
AI: OpenAI / Anthropic / Ollama adapters
License: MIT

Try It

🌐 Cloud: https://ai-watch-tester.vercel.app
💻 GitHub: https://github.com/ksgisang/AI-Watch-Tester

Ollama adapter is also included for local execution, though it's still experimental — best results with larger models.

What I Learned Building This

AI is great at generating test plans, bad at executing them. That's why I separated generation (AI) from execution (Playwright). Trying to do both with AI is expensive and fragile.
Language detection matters. My first users got Korean test scenarios on English sites. Lesson: always detect the target site's language before generating.
Assert validation is critical. AI sometimes generates structurally invalid assertions. A post-processing validator that auto-corrects the schema saved me from shipping broken tests.

Bug reports, feedback, and PRs are all welcome. What edge cases should I try next?

DEV Community: ksgisang