DEV Community

Cover image for I built an open-source AI agent that writes and runs E2E tests — here's what I learned
ksgisang
ksgisang

Posted on

I built an open-source AI agent that writes and runs E2E tests — here's what I learned

The Problem

Every new project, same story: write login tests, write form validation tests, write navigation tests. Copy-paste from the last project, tweak selectors, pray nothing breaks.

After 25 years in IT, I decided to automate the boring part. I built AWT (AI Watch Tester) — an open-source tool where you enter a URL, and AI writes the tests for you.

How It Works

  1. Enter a URL — that's your only input
  2. AI scans the page — analyzes DOM structure + takes screenshots
  3. Generates test scenarios — login flows, form validation, navigation checks
  4. Runs them with Playwright — real browser, real clicks, real screenshots

No selectors to write. No test scripts to maintain. AI handles the planning, Playwright handles the execution.

"Can't Claude/GPT Just Do This with Computer Use?"

Fair question. I get it a lot.

Computer Use is a general-purpose GUI agent — it can click buttons and type text. But for E2E testing, you'd still need:

  • Docker environment setup
  • Screenshot pipeline management
  • Result parsing and storage
  • CI/CD integration
  • Scenario tracking across runs

And each test costs $0.50–2.00 because the AI processes every screenshot.

AWT uses AI only for test generation (analyzing what to test), then runs tests with Playwright — no per-screenshot AI cost. A typical scan costs $0.002–0.03. That's 10–100x cheaper.

Think of it this way: Computer Use is the hammer. AWT is the furniture store.

What Makes It Different

AWT Playwright/Cypress testRigor/Applitools
Test writing AI writes them You write them AI assists
Cost Free (MIT) + BYOK Free $800+/mo
AI provider Your choice (OpenAI, Anthropic, Ollama*) N/A Locked in
Local mode Yes (Ollama, experimental) Yes No

*Ollama adapter is included but experimental — works best with larger models (70B+). Results may vary with smaller models.

The Honest Limitations

This is v1.0 by a solo developer. Let me be upfront:

  • ✅ Works well on simple login/form pages (SauceDemo, standard auth flows)
  • ⚠️ Complex SPAs with heavy dynamic content — still improving
  • ⚠️ No cancel button for long scans yet
  • ⚠️ Free plan is limited (5 pages per scan)

Tech Stack

  • Backend: Python, FastAPI, Playwright
  • Frontend: Next.js, TypeScript
  • Database: PostgreSQL (Supabase)
  • AI: OpenAI / Anthropic / Ollama adapters
  • License: MIT

Try It

Sign up → Settings → Enter your OpenAI key → Start scanning.

Ollama adapter is also included for local execution, though it's still experimental — best results with larger models.

What I Learned Building This

  1. AI is great at generating test plans, bad at executing them. That's why I separated generation (AI) from execution (Playwright). Trying to do both with AI is expensive and fragile.

  2. Language detection matters. My first users got Korean test scenarios on English sites. Lesson: always detect the target site's language before generating.

  3. Assert validation is critical. AI sometimes generates structurally invalid assertions. A post-processing validator that auto-corrects the schema saved me from shipping broken tests.

Bug reports, feedback, and PRs are all welcome. What edge cases should I try next?

Top comments (0)