Let's Automate 🛡️ for AI and QA Leaders

Posted on May 5 • Originally published at pub.towardsai.net on May 4

How to Generate Cypress, Playwright, and WebdriverIO Tests From Natural Language Using AI

#programming #testautomation #devops #artificialintelligen

A step-by-step breakdown of an open-source platform that converts plain English requirements into runnable E2E tests — no manual coding required

Writing end-to-end tests is one of those things every developer knows they should do well and almost nobody actually enjoys. You spend an hour getting a Playwright spec to click the right button, another hour figuring out why the selector breaks in CI, and by then the feature has already been redesigned anyway.

So when I came across a project that lets you describe what you want to test in plain English — and then generates the actual test code — I had to dig in.

The project is called AI Natural Language Tests , built under AI Quality Lab. It is open source on GitHub, has a published academic DOI on Zenodo, and as of this week just shipped v5.0.0. You can also try it right now in your browser on Hugging Face Spaces — no installation needed.

Here is what it does, how it works, and why it deserves a spot in your QA toolkit.

The Core Idea

Instead of writing:

cy.get('#username').type('admin')
cy.get('#password').type('secret')
cy.get('[type=submit]').click()
cy.contains('Dashboard').should('be.visible')

You just say:

"Test login with valid credentials"

The platform reads that sentence, visits the URL you point it at, analyzes the live HTML to find the actual form fields and selectors, then generates a complete runnable test — in Cypress, Playwright, or WebdriverIO, whichever you prefer.

That is the pitch. But the internals are more interesting than the pitch.

What Is Actually Happening Under the Hood

This is not a thin wrapper around a ChatGPT call. It runs a structured five-step workflow built with LangGraph — and each step has a clear purpose.

Step 1 — Understand the page. When you pass a --url, the system fetches the live HTML and extracts real selectors, form fields, and interactive elements. This is what prevents it from hallucinating IDs that do not exist on your page.

Step 2 — Check memory. The system keeps a vector database (FAISS + SQLite) of patterns from every test it has previously generated. Before writing anything new, it searches for similar past tests using semantic similarity. If it has seen a login flow before, it reuses what worked.

Step 3 — Generate with an LLM. The actual test code is produced by your choice of LLM — OpenAI, Anthropic Claude, or Google Gemini. LangChain handles prompt templating and output parsing, while LangGraph turns the multi-step flow into a repeatable, auditable pipeline rather than a single prompt-and-pray call.

Step 4 — Optional human review. There is a --approve flag that pauses execution before saving the generated test and asks a human to confirm. This Human-in-the-Loop gate is especially useful when running the tool against production-critical flows where you want a set of eyes before anything gets committed.

Step 5 — Run it. Pass --run and the tool immediately executes the generated test through the framework runner. If it fails, an AI-assisted failure analyzer categorizes the error and suggests a fix — more on that below.

Getting Started Takes About Five Minutes

git clone https://github.com/aiqualitylab/ai-natural-language-tests.git
cd ai-natural-language-tests
python -m venv .venv && source .venv/bin/activate # macOS/Linux
pip install -r requirements.txt
npm ci
npx playwright install chromium

Add your API key to a .env file:

OPENAI_API_KEY=your_key

Then generate and immediately run a test:

python qa_automation.py "Test login with valid credentials" \
  --url https://the-internet.herokuapp.com/login \
  --framework playwright \
  --run

That single command fetches the page, generates a .spec.ts file, and runs it through Playwright — without you writing a line of test code.

If you just want to see it work before installing anything, the live Hugging Face Spaces demo lets you paste in a requirement and watch the generation happen in real time.

Three Frameworks, One Workflow

The tool supports all three major E2E frameworks with the same natural language interface. You switch between them with a single flag:

3 Frameworks, One Workflow

The Cypress integration is worth noting specifically — it supports two distinct modes. The traditional mode generates standard Cypress code. The prompt-powered mode uses cy.prompt() to keep natural language embedded directly in the test, which is useful for teams exploring the newer AI-native Cypress APIs.

If your team is mid-migration from Cypress to Playwright, you can generate equivalent tests in both frameworks from the same requirement and compare them side by side.

Writing Prompts That Actually Work

The output quality depends heavily on how specific you are. A few patterns that work well:

Name the expected outcome. “Test login fails with wrong password and shows an error message” produces a far more precise test than “Test login.”

Chain multiple requirements. You can pass several prompts in one run: "Test login" "Test logout" --url — each gets its own generated file.

Always use --url. Giving the tool a real page means it reads actual HTML instead of guessing selector names. This is the single biggest factor in test quality, because the generator extracts real element IDs and attributes from the live DOM.

Some practical examples:

Usage : https://github.com/aiqualitylab/ai-natural-language-tests#usage

When Tests Fail: AI-Assisted Diagnosis

One of the more practical features is the failure analyzer. Instead of staring at a cryptic Cypress error, you pass it to the tool:

python qa_automation.py --analyze "CypressError: Element not found"

The analyzer categorizes the error into one of ten types — SELECTOR, TIMING, ASSERTION, NETWORK, STATE, NAVIGATION, INTERACTION, CONFIGURATION, ENVIRONMENT, or DYNAMIC_URL — then gives you a plain-English explanation of the root cause and a concrete suggestion for fixing it.

You can also pipe in a full log file: python qa_automation.py --analyze -f error.log

The Quality Evaluation Layer

This is the part most people skip over in the README, but it is arguably the most important piece for teams that care about reliability.

Generating test code is only valuable if the generated tests are actually correct. The project includes two evaluation scripts that measure whether the output is grounded in the real page content.

Offline evaluation (no API key needed). The ragas_nlp_evaluator.py script compares generated output against a reference dataset using ROUGE and string similarity metrics. It runs entirely offline, exits with a non-zero code if quality drops below a configurable threshold, and is designed to run as a fast CI gate.

LLM-based evaluation (requires OpenAI key). The ragas_evaluator.py script goes further. It fetches the live page HTML, uses GPT-4o-mini to answer the test requirement using that HTML, then scores the generated test on four dimensions: faithfulness to the page, relevance to the requirement, context precision, and context recall.

Both evaluators are wired into the GitHub Actions CI pipeline. The offline script runs first as a baseline check. If it passes, three parallel jobs spin up — one per framework — each generating tests, evaluating them with the LLM evaluator, and then executing them. If the score drops below threshold, the pipeline blocks before the tests even run.

You are not shipping generated tests blindly. You have a measurable, automated quality signal at every stage.

Docker and CI/CD

The project ships pre-built Docker images on GitHub Container Registry. You can skip the clone entirely:

docker pull ghcr.io/aiqualitylab/ai-natural-language-tests:latest

docker run --rm \
  -e OPENAI_API_KEY=your_key \
  ghcr.io/aiqualitylab/ai-natural-language-tests:latest \
  "Test login" --url [https://the-internet.herokuapp.com/login](https://the-internet.herokuapp.com/login)

For CI/CD, pin to a specific release tag (v5.0.0) rather than latest for reproducibility. The recommended pipeline stages cover dependency installation, NLP baseline evaluation, test generation, LLM evaluation, test execution, and optional telemetry export to Grafana Tempo and Loki.

Observability (Optional but Thoughtful)

If your team runs Grafana, the project has native OpenTelemetry integration that exports traces to Grafana Tempo and ships logs to Loki. This is entirely optional — leaving the relevant environment variables unset disables it completely. But for teams that already operate a Grafana stack, having AI test generation traces alongside your application traces is a genuinely useful debugging surface.

What It Does Not Do Yet

To be fair about the limits: the current CLI works through URL-driven generation. A --data flag for passing raw JSON specifications directly is not implemented yet. If your tests target APIs or non-rendered content, you will need to adapt. Given the active release cadence — nine releases with v5.0.0 landing this week — that gap may close soon.

Why This Matters Beyond the Tool Itself

The bottleneck in most QA pipelines is not running tests — it is writing them. Engineers skip test authoring because it is slow, tedious, and breaks constantly as UIs change. This tool makes the first draft essentially free, which lowers the activation energy enough that more tests actually get written.

The pattern memory design compounds the value over time. Every test the system generates gets stored as a vector embedding. Future generations for similar requirements pull from those patterns, so the output becomes more consistent and more project-specific as usage grows. It is not just generating tests in isolation — it is building institutional knowledge about how your application is structured.

The Ragas evaluation layer means you can measure whether that knowledge is accurate, and block on it in CI if it is not.

Try It

The project is open source at github.com/aiqualitylab/ai-natural-language-tests.

Want to experiment without installing anything? The live demo is on Hugging Face Spaces.

Hugging Face Space

DEV Community