DOM Serialization for AI Testing: Why We Bet on Structure Over Screenshots

#testing #ai #automation #webdev

Modern LLMs offer powerful capabilities for test automation. But there’s a catch: they can only reason as well as the data you give them. And when it comes to testing the web, messy or noisy data leads directly to hallucinations, broken flows, and unusable test cases.

As we described in the 1st and 2nd articles about AI-powered end-to-end testing, we are working on the automated test case generation system that software engineers can rely on. That’s not about record&playback, but about the full-fledged generation of meaningful test cases.

In the picture below, you can see a simplified scheme of how end-to-end testing works in Treegress. In this article, we describe the first block outlined on the scheme (DOM serialization) - the basis that allows us to say that our system is reliable and has predicted stable results. This article explains why we chose the DOM serialization approach, what it took to get it right, and how it compares to alternatives like screenshots and visual models.

The Business Value: Accuracy, Predictability, and Cost Reduction

AI testing is only as good as the data that powers it. Visual-based systems that send screenshots or videos to multimodal models like GPT-4o quickly run into two serious business problems:

1. It’s expensive — thousands of tokens per interaction.
2. It’s unreliable — visuals miss semantic context and lead to errors in test generation.
3. It’s uncontrollable - raw screenshots don’t let us guide the model. LLMs rely on vague visual input and general pretraining, without structure or context. Studies show that well-structured prompts drastically reduce hallucinations and improve reliability, something raw images can’t offer.

In other words: the less structured and domain-aware your prompt is, the worse your outcome. Screenshots are the least structured prompt possible.
A smarter alternative is DOM serialization. By feeding the model only the meaningful elements of the interface, we:

DOM serialization pipeline filters noise and embeds two decades of empirical knowledge about real-world frontend development
Reduce costs by filtering irrelevant UI components
Dramatically improve test accuracy
Eliminate LLM hallucinations
Build stable, repeatable test flows

Here is the comparison of DOM serialization and visual models approaches:

In short: cleaner input, smarter output, better ROI.

The Journey: From Simple Snapshots to Precision DOM Maps

Our first version of a DOM serializer was direct and naive, and by that, very stupid. It treated every visible element as potentially interactive, and we quickly ran into issues with noise: only ~3% of its output was actually useful. That meant 97% of the noise.
We tried other tools like Browser Use, which performed much better (about half of the usable data), but still fell short. Complex CRMs, layered modals, and custom dropdowns broke every generic solution we tested.

So we built our own.

We applied our 20+ years of frontend experience and created a new kind of DOM serializer. One that doesn't just parse HTML, but deeply understands what’s actionable on the screen.

Our Approach: Building a DOM That Makes Sense

We don’t just serialize the DOM. We reconstruct a representation of the UI that includes:

Interactive elements, detected by 18 empirical rules (rules that we have collected over 20 years of front-end technology development),
Analysis of 20+ JS frameworks (React, Angular, Vue, Hotwire, etc.),
Semantic and technical specifics, such as label + input, data-attributes, class naming + semantic markup, accessibility trees, and shadow DOM.
CSS and z-index layers, to determine what’s visible and actionable.
DOM mutation tracking, to account for JavaScript hydration and dynamic components.

We filter out invisible or irrelevant layers, group related components (like dropdowns or calendars), and assign roles to elements (button, input, datepicker, etc.).

This gives us a high-resolution map of the screen that an LLM can understand. No hallucinations, no fake buttons, no broken flows.

Why Not Just Use Visual Models?

Visual models sound great in theory. They "see" the screen like a human. But in practice, they’re expensive and incomplete.

High cost: every screenshot costs thousands of tokens.
Lack of semantic insight: visuals can’t detect div vs. button vs. custom components.
Poor generalization: visual models often fail on non-standard layouts or layered modals.

DOM serialization gives us control. We decide what goes into the model. We reduce the noise. And we build a flow that can be reproduced and debugged.

The only category of UI we can't support is canvas-based rendering (like Flutter for Web), which lacks a DOM entirely. Even then, we’re planning to train a model for visual fallback. But that comes later.

Final Result: 95% Precision, Scalable AI Testing

Our current serializer identifies ~95% of real interactive elements across a diverse dataset of old, modern, and complex web apps.

That includes:

modern React/Angular/Vue apps,
legacy systems,
CRM platforms with complex flows,
Shadow DOM-heavy interfaces.

With this foundation, we can:

Generate reliable test cases using LLMs and a rule-based approach developed by professional QA engineers
Run tests without LLMs (for predictability and cost)
Use self-healing to recover from failures

The Future: Visual Models on Our Terms

Once we lock in our primary flow, our next step is training a visual model tailored to the DOM maps we’ve built. Because our structure includes visual anchors, we can mark up thousands of real sites, create a dataset, and train a lightweight object detector.

This model will be:

ultra-fast,
cost-efficient,
and tuned to real-world UI patterns.

That’s how we combine the best of both worlds: structure and vision. But it all starts with one thing done well: clean, smart DOM serialization.

DOM serialization isn’t just cheaper than visual models — it’s better. It gives AI models the right data to make the right decisions, prevents hallucinations, and creates predictable, scalable automation. At Treegress, we’ve invested heavily in getting this right, and we believe it’s the key to making AI-powered testing truly production-ready.