DEV Community: Anna Karnaukh

DOM Serialization for AI Testing: Why We Bet on Structure Over Screenshots

Anna Karnaukh — Mon, 14 Jul 2025 20:05:01 +0000

Modern LLMs offer powerful capabilities for test automation. But there’s a catch: they can only reason as well as the data you give them. And when it comes to testing the web, messy or noisy data leads directly to hallucinations, broken flows, and unusable test cases.

As we described in the 1st and 2nd articles about AI-powered end-to-end testing, we are working on the automated test case generation system that software engineers can rely on. That’s not about record&playback, but about the full-fledged generation of meaningful test cases.

In the picture below, you can see a simplified scheme of how end-to-end testing works in Treegress. In this article, we describe the first block outlined on the scheme (DOM serialization) - the basis that allows us to say that our system is reliable and has predicted stable results. This article explains why we chose the DOM serialization approach, what it took to get it right, and how it compares to alternatives like screenshots and visual models.

The Business Value: Accuracy, Predictability, and Cost Reduction

AI testing is only as good as the data that powers it. Visual-based systems that send screenshots or videos to multimodal models like GPT-4o quickly run into two serious business problems:

1. It’s expensive — thousands of tokens per interaction.
2. It’s unreliable — visuals miss semantic context and lead to errors in test generation.
3. It’s uncontrollable - raw screenshots don’t let us guide the model. LLMs rely on vague visual input and general pretraining, without structure or context. Studies show that well-structured prompts drastically reduce hallucinations and improve reliability, something raw images can’t offer.

In other words: the less structured and domain-aware your prompt is, the worse your outcome. Screenshots are the least structured prompt possible.
A smarter alternative is DOM serialization. By feeding the model only the meaningful elements of the interface, we:

DOM serialization pipeline filters noise and embeds two decades of empirical knowledge about real-world frontend development
Reduce costs by filtering irrelevant UI components
Dramatically improve test accuracy
Eliminate LLM hallucinations
Build stable, repeatable test flows

Here is the comparison of DOM serialization and visual models approaches:

In short: cleaner input, smarter output, better ROI.

The Journey: From Simple Snapshots to Precision DOM Maps

Our first version of a DOM serializer was direct and naive, and by that, very stupid. It treated every visible element as potentially interactive, and we quickly ran into issues with noise: only ~3% of its output was actually useful. That meant 97% of the noise.
We tried other tools like Browser Use, which performed much better (about half of the usable data), but still fell short. Complex CRMs, layered modals, and custom dropdowns broke every generic solution we tested.

So we built our own.

We applied our 20+ years of frontend experience and created a new kind of DOM serializer. One that doesn't just parse HTML, but deeply understands what’s actionable on the screen.

Our Approach: Building a DOM That Makes Sense

We don’t just serialize the DOM. We reconstruct a representation of the UI that includes:

Interactive elements, detected by 18 empirical rules (rules that we have collected over 20 years of front-end technology development),
Analysis of 20+ JS frameworks (React, Angular, Vue, Hotwire, etc.),
Semantic and technical specifics, such as label + input, data-attributes, class naming + semantic markup, accessibility trees, and shadow DOM.
CSS and z-index layers, to determine what’s visible and actionable.
DOM mutation tracking, to account for JavaScript hydration and dynamic components.

We filter out invisible or irrelevant layers, group related components (like dropdowns or calendars), and assign roles to elements (button, input, datepicker, etc.).

This gives us a high-resolution map of the screen that an LLM can understand. No hallucinations, no fake buttons, no broken flows.

Why Not Just Use Visual Models?

Visual models sound great in theory. They "see" the screen like a human. But in practice, they’re expensive and incomplete.

High cost: every screenshot costs thousands of tokens.
Lack of semantic insight: visuals can’t detect div vs. button vs. custom components.
Poor generalization: visual models often fail on non-standard layouts or layered modals.

DOM serialization gives us control. We decide what goes into the model. We reduce the noise. And we build a flow that can be reproduced and debugged.

The only category of UI we can't support is canvas-based rendering (like Flutter for Web), which lacks a DOM entirely. Even then, we’re planning to train a model for visual fallback. But that comes later.

Final Result: 95% Precision, Scalable AI Testing

Our current serializer identifies ~95% of real interactive elements across a diverse dataset of old, modern, and complex web apps.

That includes:

modern React/Angular/Vue apps,
legacy systems,
CRM platforms with complex flows,
Shadow DOM-heavy interfaces.

With this foundation, we can:

Generate reliable test cases using LLMs and a rule-based approach developed by professional QA engineers
Run tests without LLMs (for predictability and cost)
Use self-healing to recover from failures

The Future: Visual Models on Our Terms

Once we lock in our primary flow, our next step is training a visual model tailored to the DOM maps we’ve built. Because our structure includes visual anchors, we can mark up thousands of real sites, create a dataset, and train a lightweight object detector.

This model will be:

ultra-fast,
cost-efficient,
and tuned to real-world UI patterns.

That’s how we combine the best of both worlds: structure and vision. But it all starts with one thing done well: clean, smart DOM serialization.

DOM serialization isn’t just cheaper than visual models — it’s better. It gives AI models the right data to make the right decisions, prevents hallucinations, and creates predictable, scalable automation. At Treegress, we’ve invested heavily in getting this right, and we believe it’s the key to making AI-powered testing truly production-ready.

How We Built an AI-Powered E2E Testing Tool (Part 2)

Anna Karnaukh — Fri, 04 Jul 2025 15:33:29 +0000

We’re still on a mission to build a true QA autopilot - one that needs only a URL to get going. In Part 1, we covered our shift from vision-based RL to self-hosted LLMs and all the tweaks that made it reliable. Now in Part 2, we’re scaling up: testing OpenAI and Gemini APIs, plugging in RAG to keep tests focused, and evolving our multi-agent flow. Let’s jump in!

1. API Era: OpenAI & Gemini

By early 2025, self-hosting small LLMs couldn’t meet scale or latency needs. We prototyped:

OpenAI GPT-3.5/4 APIs: Good quality, big context windows, ~0.5–1 s per call.
Google Gemini API:
- Latency: ~0.8 s per 1k tokens.
- Quality: Superior in complex navigation and multi-step wizards.
- Cost: Comparable to GPT-3.5, with more generous quotas.

We decided to continue experimenting with Gemini.

Why Gemini?

Bypassed local VRAM constraints.
Handled thousands of tokens (full-page JSON) without major trimming.
Scaled horizontally: we spun up multiple API workers in Docker Swarm.

That enabled us to test:

1. Multi-step Forms. “Part 1 on Step 1 → Click Next → Part 2 on Step 2 → Submit” flows worked reliably, since Gemini tracked state across contexts.
2. CRM CRUD Workflows. Complex user flows - like “Create record → Edit → Delete” - across nested tables and modals can now be executed in a single prompt. This is a crucial step forward: for example, to properly test the delete functionality for a "Lead" in a CRM, you first need to create a Lead, then delete it, ensuring you're not interfering with any existing data. Treegress now makes this kind of isolated, end-to-end scenario possible with one command.

2. Introducing RAG for Larger Pages

We understand that generating a large number of test cases is pointless if they aren’t clear and meaningful. That’s why we rely on our team’s deep experience in software engineering and subject matter expertise in QA to ensure every test adds real value. So we built a basic RAG pipeline:

QA-Curated Test Templates. QA engineers wrote generic test-case templates for common page types (Login, Contact Us, CRUD lists).
Retrieval at Runtime. When our system knows it’s on a Contact Us page, it pulls that template from our document store.
LLM calls → Tests. Because we lock onto QA templates, hallucinations drop, and coverage improves.

Result:

Predictability.
No fluff—just real, meaningful test cases.
Easier to cover edge cases without relying solely on LLM's creative power.
Templates + full-page JSON give LLM enough context to customize tests per page.

Multistep support, work with intentions

3. What Failed & Why

Vision+Text RL Models: Massive and slow.
Flat List JSON: Lost parent/child relations—poor element detection.
- Fix: Switch to nested JSON reflecting the true DOM tree.
Local VLLM + Large Models: OOMs, crashes, high latency.
- Lesson: If you need low latency and high concurrency, self-hosting big models can be a dead end—APIs often win.

Conclusion

We set out to build an AI agent that “sees” a page, “understands” its purpose, and “generates” or “validates” test cases with minimal manual work. Our path led us from massive vision+RL efforts to lean JSON+LLM setups, and ultimately to multi-agent architecture, a hybrid of self-hosted LLMs, external APIs, and RAG. Each dead end taught us something crucial: raw vision is heavy (but still can be helpful), flat JSON loses context, and local LLM hosting runs into VRAM walls.

Today, our production flow looks like this:

Website Crawler & Analyzer – Parses and serializes each page into nested JSON. Thanks to our unique DOM handling, it includes a built-in self-healing engine (a topic we’ll cover in a separate article).
AI Agent for Test Case Generation – Uses a RAG layer to fetch QA templates from a file store based on the page type. These are combined into a prompt and sent to the Gemini API, which returns structured test cases or direct verdicts.
AI Agent for Test Case Verification – Reviews the generated test cases for accuracy and relevance.
Backend Services with Playwright – Executes the tests against the live environment.

It’s a long journey - not just to generate test cases, but to build a platform that truly understands your website, generates meaningful, context-aware test cases, and ensures reliability with predictable outcomes. But let us be clear: we didn’t set out to build a Copilot for QA automation — we’re building an Autopilot for testing.

If you’d like to be part of that journey, we’d love for you to try Treegress and share your thoughts.

AI Testing Isn’t Magic — How We Built an AI-Powered E2E Testing Tool (Part 1)

Anna Karnaukh — Tue, 17 Jun 2025 21:22:35 +0000

We set out to build a system that could generate and run end-to-end tests automatically with predictable results and minimal input from the user. No recorded flows, no screenshots, no manual setup, just a URL. The system scans the page, builds a structured view using a serialized DOM, and generates test cases based on what it finds.
In this article, we’re openly sharing the engineering side of that journey - what we tried, what didn’t work, and how the system evolved. We started with reinforcement learning, moved on to transformer-based models, and eventually landed on a mix of self-hosted LLMs, external APIs, and RAG patterns. And yes, we ended up building a multi-agent system - though not because it’s trendy, but because the architecture naturally led us there.
If you're curious about integrating AI into test automation or just curious how this kind of system comes together, read on.

1. Early Experiments: Reinforcement Learning on Visual Inputs

We initially approached UI testing as a reinforcement learning (RL) problem. The idea was to train an agent to “see” a rendered web page: locate input fields and buttons based on screenshots, and act as a human user would. We fed screenshots into an RL model so it could learn where fields and click targets were. After a few iterations into the RL research, we began building a visual dataset and training a multimodal model that combined raw HTML (as text) with corresponding screenshots. The hope was that the model would use HTML structure and pixel data together to detect interactive elements. In practice, this setup was:

Huge (15 GB+ models). The RL we experimented with quickly ballooned in size.
Slow to train/infer. Every RL step required rendering a screenshot, feeding it into a massive vision+text network, and running a full forward pass—way too expensive for anything like CI.
Unstable. We never got to a “production-ready” state. After a couple of weeks of labeling a few hundred sites for object detection (e.g., headers, footers, forms), we hit dead ends when generalizing to arbitrary pages.

That effort taught us two things: purely vision-based RL for E2E testing was “prehistoric” in our context (we had no real stake in it anymore), and we needed lighter, text-centric approaches to move forward.

2. Pivot to Text-First Models & NLP Tasks

After experimenting with RL-based visual parsing, we shifted to an NLP-based approach, starting with a few simple language tasks:

1. Language Detection. We needed to know what language a page’s text was in. We pulled in a few proven, lightweight language detection models - self-hosted, < 100 MB each, MIT/Apache-licensed - since we try to keep things cost-effective and avoid paying for 3d-party services when possible.
2. Page Summarization. We wanted a quick way to classify a page’s purpose. For example, if we click “Contact Us” and the model summarizes “This is the Contact Us page,” we know the navigation worked. If it summarizes “Login,” that signals a bug.

Example of page summarization

At first, we tried specialized summarization models available online. Every one of them was 1.5 GB in general—way too large, outdated, and underwhelming in quality. Then we discovered that the smallest LLMs (1–2 GB) often outperformed older summarizers.

3. Self-Hosted LLM Era: Finding the Right Model

We hunted on Hugging Face for small, open-source transformer models with MIT or Apache licenses to enable cost-effective summarization, navigation testing, and eventually reuse the same model for test case generation. Here’s what we tried:

Microsoft PHI (1.5 GB). The model didn’t meet our quality bar.
QWEN Series (Alibaba). They offer a tiered lineup: 0.5 B, 1.5 B, 3 B, 7 B parameters. The 1.5 B “QWEN-tiny” was our first real winner - weights could fit on a single 16 GB GPU, inference latency was reasonable, and summarization quality was solid.

How we used it: We ran the QWEN-1.5B model in Docker and used our DOM serializer for E2E flows. When navigating (e.g., “Click Contact Us”), the serializer generated a clean, structured JSON of the page. For summarization, we sent only the visible text—not the full JSON - to the LLM. If the summary matched our expectations, the step passed; if not, we flagged a bug.

Why JSON over raw HTML? Raw HTML includes a lot of unnecessary data like styles and classes, which wastes tokens. Our serializer kept only the useful parts—tags, types, key attributes - and reduced token usage by over 50%. We later added a nested structure so the model could better understand the page layout.

4. From Summaries to Element-Finding & Test Case Generation and Validation

Once summarization was reliable, we shifted to the next core tasks:
1. Test Case Suggestions. The system generates test cases using only a URL as input - no need for users to record and play back actions, provide documentation, or upload baseline screenshots. It has to understand the page, identify fields and forms, figure out how to interact with them, and generate meaningful test cases with concrete steps from there.

Imagine it like asking LLM: “Given these elements, generate basic E2E test cases for login.” It produced 4-5 JSON-formatted test scenarios (valid login, invalid credentials, missing fields, etc.).
It’s important to highlight that element discovery played a crucial role in generating test case suggestions. We have to prompt LLM with the JSON DOM and ask it to list form fields/buttons related to login. It returned a JSON array of elements (e.g., “email,” “password,” “submit button”).

2. Test Execution. The system runs the generated test cases, validates the results against the expected assertions, and provides a clear conclusion to the user.

For these tasks, we needed a reliable interaction between the back-end and the LLM - a proper validation loop. It’s very much a “ping-pong” process:

Backend (Node.js + Playwright) sends the DOM to LLM.
LLM returns test cases → Playwright executes them → grabs new DOM.
LLM compares old vs. new DOM summary → returns pass/fail.

A simplified version of the process is shown in the diagram below.

We set temperature=0 to force determinism, so LLM produced the same output for identical inputs.

5. Technical Hurdles & Micro-Optimizations

Context Window Overflows: Pages with a large number of elements and a complex layout still overflowed LLM’s token limit. Also, the question is that every Front-End developer codes in their own way, and there is no strict typification as the back-end developers have. And here we're not even talking about complex pages with tables, charts, pagination, or filters - sometimes we hit the token limit even with simple login pages.

DOM Splitting: We split the serialized JSON into blocks (header, content, footer), summarized each, and merged the results—partial win.
ID Compression: Replaced long UUIDs with small integers before sending to LLM, then remapped after inference. Saved ~30 % of tokens when element counts were high.

Memory & Latency: To keep the platform cost-effective, we opted to use a self-hosted model running in our internal cluster, without relying on external APIs.

QWEN-1.5 B (~3 GB VRAM): quality was limited.
QWEN-3 B (8 GB): improved output, but came with a “Research” license, which we couldn’t use.
QWEN-7 B (14 GB): required quantization to work with the vLLM engine to fit our memory space.

To improve speed and reduce memory usage, we switched from Hugging Face’s Python runner to vLLM, which supports attention caching. Even with vLLM, the full 7B model occasionally ran out of memory, so in the end, we ran a quantized version of QWEN-7B on vLLM. This reduced VRAM usage, improved inference speed by about 30%, and delivered acceptable accuracy.

Agent Architecture (Ping-Pong): The challenge was stability: a single user could have dozens of projects, each with hundreds of pages. The system had to handle that volume while keeping the “ping-pong” flow intact. As described earlier, the LLM received a serialized DOM, identified elements, and generated actions. The back end executed them using Playwright, captured the updated page, and returned it for validation. But REST calls became unstable under load, and vLLM didn’t work well with uWSGI, so we switched to a Redis-based queue for more reliable communication. Despite these tweaks, throughput remained up to 50s per test generation or validation - too slow for large suites. Multi-user CI would require multiple GPU instances, which wasn’t cost-effective.

So the next step was to:
– optimize the input sent to the LLM,
– review the entire flow,
– and consider switching to external APIs like OpenAI or Gemini.

That’s exactly what we’ll cover in the next article - stay tuned!
Give it a spin: paste any URL into Treegress and see how it works in action.

How We Built UI Bug Detection from Scratch: What Worked and What Didn't

Anna Karnaukh — Thu, 22 May 2025 14:13:29 +0000

When we first started planning our own test automation product, our core goal was full end-to-end testing — a system that could test any website automatically, with minimal manual setup. Ideally, it would be as simple as providing a link and letting the system handle the rest. To move faster, we decided to start with what looked like low-hanging fruit: UI bug detection.

It sounded simple enough. But once we got into it, we realized just how tricky it was. We explored multiple approaches, ran into licensing and model limitations, spent weeks generating datasets, and rebuilt parts of the system more than once.

This is a step-by-step look behind the scenes at how we designed and developed our UI bug detection system — and what we learned along the way.

Challenges on Our Path to UI Bug Detection

Designing the System Architecture

From the beginning, we aimed to keep the architecture lightweight — a modular system made of small, simple functional pieces. Cloud API showed potential, but it was too expensive for production use. So, the next logical step was to train an object detection model on our custom dataset to reduce costs while keeping performance.

Licensing: The Roadblock We Didn’t Expect

We chose YOLO as our base model for detecting visual bugs. It was fast, well-documented, and great for object detection tasks — exactly what we needed.

At first, we focused on YOLO NAS, since it was one of the few variants with a business-friendly license - Apache. Everything looked good, so we integrated it into our pipeline and started working with the pre-trained model provided.

Later, when we took a closer look at the full licensing terms, things got tricky.

While the core YOLO NAS framework had a permissive license, the pre-trained model weights were licensed differently. According to the terms, using them in a product required us to open-source our own code — something we clearly couldn’t do.

This wasn’t obvious at the start, and it wasn’t mentioned front and center. But once we read the fine print in the documentation, the problem became clear: we couldn’t legally use those pretrained models in a commercial product.

So we had to change direction. We retrained the model from scratch using only our own data and infrastructure — no third-party weights involved. We thought that switching from a pretrained model to re-trained from scratch by us would decrease the quality of the output, BUT the results were even a bit better

Takeaway: When working with any open-source model, check every layer — not just the framework, but the weights, datasets, and any dependencies. Licensing issues can sneak in where you least expect them.

Building the Dataset from Scratch

Once we decided to train the model on our own dataset, we faced the hardest part — creating a dataset from scratch, because there was no ready-to-use data due to the type of UI bugs we were planning to detect.

At first, we built our own crawler. It could automatically browse websites, inject scripts, modify elements, and generate labeled screenshots. We even made sure it could pick up where it left off if something crashed, which happened a lot. Still, the entire process was slow and fragile.

Second, we thought it would be simple: just break some styles, take screenshots, and start training. But of course, it turned out to be much more complex.

We wrote scripts that used JavaScript and Selenium to manipulate live websites. We disabled images, shifted elements so they overlapped, and tweaked layouts in weird ways. After that, we captured screenshots and recorded the exact coordinates of each visual change. That gave us the raw materials for training, but it was painfully slow. And also, there were many errors - we tried to randomly modify the web page with JS, but markdown mutated differently.

Each sample took about three seconds to generate. And we needed thousands. Tens of thousands. The more we tried to scale it, the more we realized: this was going to eat up time, memory, and patience.
Still, the entire process was slow and fragile.

Synthetic Data Generation

Eventually, we hit a wall. Scraping real websites and breaking them on the fly was too slow and inconsistent, because while using real screenshots, we had to filter appropriate sets manually. That’s when we added synthetic data.

In addition to using existing websites, we began creating simple UI layouts on canvas from scratch. We manually placed overlapping texts, “broke” images by overlaying error graphics, and created fake popups with randomized elements. We started simulating UI bugs ourselves, in a fully controlled environment.

With synthetic data, we didn’t have to worry about waiting for page loads, dealing with broken links, or handling unpredictable website structures. We could generate examples quickly, with the exact bugs we wanted, and in the right format for training.

It wasn’t just faster — it was also cleaner. The model got better training inputs, and we spent way less time cleaning up bad screenshots or fixing crawler bugs.
The issue was that synthetic data only covered a limited range of UI distortions - about 60%. Additionally, the dataset elements were too similar to each other — we needed more variety.

From then on, we used a mixed approach: part real websites, part modified sites, and part synthetic data. And that’s when we finally started making real progress.

Making the Model Smarter About UI Bugs

As the project developed, our definition of a "UI bug" naturally expanded. We started with the most visible problems — unreadable text, overlapping elements, and broken layouts. But soon we realized that there were many other subtle yet impactful issues worth catching.

Things like inconsistent letter spacing, unnecessary scrollbars caused by layout shifts, or mismatched font sizes across components began to surface as meaningful categories. Popups — such as cookie banners and modal dialogs — also became part of our scope, since they often interfere with user interaction.
To detect these, we generated custom synthetic data. We built simplified UI layouts, layered elements in different ways, and added visual details like shadows to mimic real-world styles. This gave the model a wide variety of examples to learn from.
We also recognized that not all bugs need bounding boxes. Some problems, like missing content or font inconsistencies, affect the entire page rather than a specific area. These worked better as image classification tasks, assigning a single label to the whole screenshot.

In the end, we built a two-track system:

Object detection for localized, visual issues like overlapping elements or broken images;
Page-level classification for broader layout or content problems.

This combined approach gave us more flexibility and accuracy. It allowed us to match the right detection method to each bug type, which turned out to be crucial for building something reliable. So, the final list of UI bugs looks as follows:

Broken Image
Missing Content
Unnecessary Scroll
Letter spacing issue
Inconsisent font size
Outdated style
Inconsistent color scheme
Empty layout
Broken layout
Overlapping content
Unnecessary horizontal scroll

Lessons Learned

Data is the real challenge — Model training is easy compared to generating a diverse, high-quality dataset. Most of our time went into building and refining the data pipeline.
Licensing matters more than you think — Even with open-source tools, you can run into restrictions. Always check licenses for models, weights, and datasets before integrating them.
Synthetic + real = best results — A mix of real websites, synthetic layouts, and manual edge cases gave us the most reliable coverage and flexibility.
UI bug detection isn’t just one feature — it’s a system. Without the right data, even the best model won’t help.

Curious how this all turned out?
We’re turning these ideas into a real tool at Treegress — a no-code platform for end-to-end testing with built-in visual bug detection.