The Problem
Every new project, same story: write login tests, write form validation tests, write navigation tests. Copy-paste from the last project, tweak selectors, pray nothing breaks.
After 25 years in IT, I decided to automate the boring part. I built AWT (AI Watch Tester) — an open-source tool where you enter a URL, and AI writes the tests for you.
How It Works
- Enter a URL — that's your only input
- AI scans the page — analyzes DOM structure + takes screenshots
- Generates test scenarios — login flows, form validation, navigation checks
- Runs them with Playwright — real browser, real clicks, real screenshots
No selectors to write. No test scripts to maintain. AI handles the planning, Playwright handles the execution.
"Can't Claude/GPT Just Do This with Computer Use?"
Fair question. I get it a lot.
Computer Use is a general-purpose GUI agent — it can click buttons and type text. But for E2E testing, you'd still need:
- Docker environment setup
- Screenshot pipeline management
- Result parsing and storage
- CI/CD integration
- Scenario tracking across runs
And each test costs $0.50–2.00 because the AI processes every screenshot.
AWT uses AI only for test generation (analyzing what to test), then runs tests with Playwright — no per-screenshot AI cost. A typical scan costs $0.002–0.03. That's 10–100x cheaper.
Think of it this way: Computer Use is the hammer. AWT is the furniture store.
What Makes It Different
| AWT | Playwright/Cypress | testRigor/Applitools | |
|---|---|---|---|
| Test writing | AI writes them | You write them | AI assists |
| Cost | Free (MIT) + BYOK | Free | $800+/mo |
| AI provider | Your choice (OpenAI, Anthropic, Ollama*) | N/A | Locked in |
| Local mode | Yes (Ollama, experimental) | Yes | No |
*Ollama adapter is included but experimental — works best with larger models (70B+). Results may vary with smaller models.
The Honest Limitations
This is v1.0 by a solo developer. Let me be upfront:
- ✅ Works well on simple login/form pages (SauceDemo, standard auth flows)
- ⚠️ Complex SPAs with heavy dynamic content — still improving
- ⚠️ No cancel button for long scans yet
- ⚠️ Free plan is limited (5 pages per scan)
Tech Stack
- Backend: Python, FastAPI, Playwright
- Frontend: Next.js, TypeScript
- Database: PostgreSQL (Supabase)
- AI: OpenAI / Anthropic / Ollama adapters
- License: MIT
Try It
- 🌐 Cloud: https://ai-watch-tester.vercel.app
- 💻 GitHub: https://github.com/ksgisang/AI-Watch-Tester
Sign up → Settings → Enter your OpenAI key → Start scanning.
Ollama adapter is also included for local execution, though it's still experimental — best results with larger models.
What I Learned Building This
AI is great at generating test plans, bad at executing them. That's why I separated generation (AI) from execution (Playwright). Trying to do both with AI is expensive and fragile.
Language detection matters. My first users got Korean test scenarios on English sites. Lesson: always detect the target site's language before generating.
Assert validation is critical. AI sometimes generates structurally invalid assertions. A post-processing validator that auto-corrects the schema saved me from shipping broken tests.
Bug reports, feedback, and PRs are all welcome. What edge cases should I try next?
Top comments (0)