Anna Karnaukh

Posted on Jun 17 • Edited on Jun 19

AI Testing Isn’t Magic — How We Built an AI-Powered E2E Testing Tool (Part 1)

#testing #ai #softwaredevelopment #automation

We set out to build a system that could generate and run end-to-end tests automatically with predictable results and minimal input from the user. No recorded flows, no screenshots, no manual setup, just a URL. The system scans the page, builds a structured view using a serialized DOM, and generates test cases based on what it finds.
In this article, we’re openly sharing the engineering side of that journey - what we tried, what didn’t work, and how the system evolved. We started with reinforcement learning, moved on to transformer-based models, and eventually landed on a mix of self-hosted LLMs, external APIs, and RAG patterns. And yes, we ended up building a multi-agent system - though not because it’s trendy, but because the architecture naturally led us there.
If you're curious about integrating AI into test automation or just curious how this kind of system comes together, read on.

1. Early Experiments: Reinforcement Learning on Visual Inputs

We initially approached UI testing as a reinforcement learning (RL) problem. The idea was to train an agent to “see” a rendered web page: locate input fields and buttons based on screenshots, and act as a human user would. We fed screenshots into an RL model so it could learn where fields and click targets were. After a few iterations into the RL research, we began building a visual dataset and training a multimodal model that combined raw HTML (as text) with corresponding screenshots. The hope was that the model would use HTML structure and pixel data together to detect interactive elements. In practice, this setup was:

Huge (15 GB+ models). The RL we experimented with quickly ballooned in size.
Slow to train/infer. Every RL step required rendering a screenshot, feeding it into a massive vision+text network, and running a full forward pass—way too expensive for anything like CI.
Unstable. We never got to a “production-ready” state. After a couple of weeks of labeling a few hundred sites for object detection (e.g., headers, footers, forms), we hit dead ends when generalizing to arbitrary pages.

That effort taught us two things: purely vision-based RL for E2E testing was “prehistoric” in our context (we had no real stake in it anymore), and we needed lighter, text-centric approaches to move forward.

2. Pivot to Text-First Models & NLP Tasks

After experimenting with RL-based visual parsing, we shifted to an NLP-based approach, starting with a few simple language tasks:

1. Language Detection. We needed to know what language a page’s text was in. We pulled in a few proven, lightweight language detection models - self-hosted, < 100 MB each, MIT/Apache-licensed - since we try to keep things cost-effective and avoid paying for 3d-party services when possible.
2. Page Summarization. We wanted a quick way to classify a page’s purpose. For example, if we click “Contact Us” and the model summarizes “This is the Contact Us page,” we know the navigation worked. If it summarizes “Login,” that signals a bug.

Example of page summarization

At first, we tried specialized summarization models available online. Every one of them was 1.5 GB in general—way too large, outdated, and underwhelming in quality. Then we discovered that the smallest LLMs (1–2 GB) often outperformed older summarizers.

3. Self-Hosted LLM Era: Finding the Right Model

We hunted on Hugging Face for small, open-source transformer models with MIT or Apache licenses to enable cost-effective summarization, navigation testing, and eventually reuse the same model for test case generation. Here’s what we tried:

Microsoft PHI (1.5 GB). The model didn’t meet our quality bar.
QWEN Series (Alibaba). They offer a tiered lineup: 0.5 B, 1.5 B, 3 B, 7 B parameters. The 1.5 B “QWEN-tiny” was our first real winner - weights could fit on a single 16 GB GPU, inference latency was reasonable, and summarization quality was solid.

How we used it: We ran the QWEN-1.5B model in Docker and used our DOM serializer for E2E flows. When navigating (e.g., “Click Contact Us”), the serializer generated a clean, structured JSON of the page. For summarization, we sent only the visible text—not the full JSON - to the LLM. If the summary matched our expectations, the step passed; if not, we flagged a bug.

Why JSON over raw HTML? Raw HTML includes a lot of unnecessary data like styles and classes, which wastes tokens. Our serializer kept only the useful parts—tags, types, key attributes - and reduced token usage by over 50%. We later added a nested structure so the model could better understand the page layout.

4. From Summaries to Element-Finding & Test Case Generation and Validation

Once summarization was reliable, we shifted to the next core tasks:
1. Test Case Suggestions. The system generates test cases using only a URL as input - no need for users to record and play back actions, provide documentation, or upload baseline screenshots. It has to understand the page, identify fields and forms, figure out how to interact with them, and generate meaningful test cases with concrete steps from there.

Imagine it like asking LLM: “Given these elements, generate basic E2E test cases for login.” It produced 4-5 JSON-formatted test scenarios (valid login, invalid credentials, missing fields, etc.).
It’s important to highlight that element discovery played a crucial role in generating test case suggestions. We have to prompt LLM with the JSON DOM and ask it to list form fields/buttons related to login. It returned a JSON array of elements (e.g., “email,” “password,” “submit button”).

2. Test Execution. The system runs the generated test cases, validates the results against the expected assertions, and provides a clear conclusion to the user.

For these tasks, we needed a reliable interaction between the back-end and the LLM - a proper validation loop. It’s very much a “ping-pong” process:

Backend (Node.js + Playwright) sends the DOM to LLM.
LLM returns test cases → Playwright executes them → grabs new DOM.
LLM compares old vs. new DOM summary → returns pass/fail.

A simplified version of the process is shown in the diagram below.

We set temperature=0 to force determinism, so LLM produced the same output for identical inputs.

5. Technical Hurdles & Micro-Optimizations

Context Window Overflows: Pages with a large number of elements and a complex layout still overflowed LLM’s token limit. Also, the question is that every Front-End developer codes in their own way, and there is no strict typification as the back-end developers have. And here we're not even talking about complex pages with tables, charts, pagination, or filters - sometimes we hit the token limit even with simple login pages.

DOM Splitting: We split the serialized JSON into blocks (header, content, footer), summarized each, and merged the results—partial win.
ID Compression: Replaced long UUIDs with small integers before sending to LLM, then remapped after inference. Saved ~30 % of tokens when element counts were high.

Memory & Latency: To keep the platform cost-effective, we opted to use a self-hosted model running in our internal cluster, without relying on external APIs.

QWEN-1.5 B (~3 GB VRAM): quality was limited.
QWEN-3 B (8 GB): improved output, but came with a “Research” license, which we couldn’t use.
QWEN-7 B (14 GB): required quantization to work with the vLLM engine to fit our memory space.

To improve speed and reduce memory usage, we switched from Hugging Face’s Python runner to vLLM, which supports attention caching. Even with vLLM, the full 7B model occasionally ran out of memory, so in the end, we ran a quantized version of QWEN-7B on vLLM. This reduced VRAM usage, improved inference speed by about 30%, and delivered acceptable accuracy.

Agent Architecture (Ping-Pong): The challenge was stability: a single user could have dozens of projects, each with hundreds of pages. The system had to handle that volume while keeping the “ping-pong” flow intact. As described earlier, the LLM received a serialized DOM, identified elements, and generated actions. The back end executed them using Playwright, captured the updated page, and returned it for validation. But REST calls became unstable under load, and vLLM didn’t work well with uWSGI, so we switched to a Redis-based queue for more reliable communication. Despite these tweaks, throughput remained up to 50s per test generation or validation - too slow for large suites. Multi-user CI would require multiple GPU instances, which wasn’t cost-effective.

So the next step was to:
– optimize the input sent to the LLM,
– review the entire flow,
– and consider switching to external APIs like OpenAI or Gemini.

That’s exactly what we’ll cover in the next article - stay tuned!
Give it a spin: paste any URL into Treegress and see how it works in action.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.