Muhammad Taqi

Posted on Jun 1

I built an AI QA agent in one week that tests your app like a real user

#opensource #testing #webdev #ai

I hate writing test scripts.

Not because testing is unimportant. The opposite. I care about quality. But every time I sit down to write Playwright or Cypress tests, I spend three hours fighting selectors that break the moment someone renames a CSS class. The tests become a second codebase that needs maintaining. And the worst part? They only test the paths I already thought of.

Real bugs don't come from paths you thought of. They come from the user who clicks the wrong button first, the one who skips the instructions, the one who types something unexpected into a form field. You can't script for behavior you haven't imagined.

So I started thinking about this differently.

What if instead of writing scripts, you just described what you want to test? And something else figured out how to test it?

That's what I built. It's called Crawlix.

How it actually works

You give it a URL and a goal:

crawlix run --url https://myapp.com --goal "complete the signup flow"

Crawlix spawns multiple AI agents simultaneously. Each one has a distinct personality, a behavioral profile that shapes how it thinks and acts.

There's the First-Timer. Never seen your app before, reads nothing, clicks whatever looks most obvious. Gets confused by jargon. Tries the big button first.

There's the Impatient user. Skips all instructions. Submits forms before filling them out. Clicks the submit button three times because nothing happened immediately.

There's the Adversarial agent. Tries SQL injection in your input fields. Uploads wrong file types. Modifies IDs in the URL to try accessing other users data.

And three more. A Power User, a Non-Native Speaker, and a Slow Network user.

Each one opens your app in a real browser, reads the actual UI elements on screen, and decides what to do next based on their personality. No hardcoded selectors. No predefined flows. Just an AI persona making decisions the same way a real person would.

When they find something broken, confusing, or unexpected they record it as a finding. Critical, warning, or info. At the end, an AI generates a full report with patterns across agents, suggested fixes, and prioritized recommendations.

What surprised me building this

The hardest part wasn't the AI integration. It was making the agents actually understand the page.

My first approach was feeding the full DOM to the LLM. Bad idea. 50KB of HTML on every step, expensive and slow. I switched to extracting just the interactive elements as a clean numbered list:

[01] button: "Sign Up"
[02] input: "Email" (email) name="email"
[03] input: "Password" (password) name="password"
[04] link: "Already have an account?"

That's maybe 2KB. The LLM reads it, decides what to interact with, returns a JSON action. The adapter executes it in Playwright. Loop.

Simple loop, but getting it right took most of the week.

The report that came out of testing my own app

I ran it on my own project and it found something I genuinely hadn't noticed.

A parent div with z-index: 10 was intercepting pointer events on several buttons. Real users were clicking those buttons and nothing was happening, silently failing. I had tested the app myself dozens of times and never caught it because I knew where to click. The adversarial agent, who knows nothing about the app, found it in 14 steps.

That's the thing about testing with agents who don't know your app. They find what you stopped seeing.

The stack

TypeScript, Playwright for browser automation, Commander for the CLI. Supports 8 LLM providers including Groq, Gemini, Cerebras, Mistral, OpenRouter, Ollama, OpenAI, and Anthropic.

Where it is now

It's v0.1. It works. I've tested it on multiple real apps including DuckDuckGo and it navigates, fills forms, clicks buttons, and generates reports correctly. But there are rough edges. Element resolution sometimes fails on apps with custom components that don't have proper accessibility attributes. Rate limiting on very long runs can still cause issues.

I'm building it in public. The web adapter is done. Next is an API adapter for testing REST endpoints without a UI, then mobile via Appium.

If you try it and hit something broken, open a GitHub issue. Early users who find bugs are the most valuable people in any open source project.

npm install -g crawlix
crawlix setup
crawlix run --url https://yourapp.com --goal "your goal here"

GitHub: github.com/m-taqii/crawlix

Top comments (2)

xulingfeng • Jun 1

The "adversarial agent tries SQL injection in your input fields" part — that's exactly why we built multiple persona profiles into our test framework too. We found that a single "generic tester" agent misses most edge cases because it stays polite.

One thing I'd be curious about: how do you handle the state between agents? If the First-Timer creates an account and the Impatient user comes next, do they share the same session or each start fresh? We ran into this where one agent's actions polluted the state for the next one.

Muhammad Taqi • Jun 2

Actually each agent has its own session that's how they can test simultaneously. Each Agent opens the app in separate browser tab ( same browser instance though ) and each one records its findings and in the end, the reporter receives all these findings and compiles a clean report