LLMs don’t fail because they’re dumb. They fail because they guess.
When you ask a model to “analyze a form” or “describe what the user sees,” it doesn’t really see anything. It relies on text you feed it — code, markup, or vague screenshots.
That’s like asking someone to fix a car by describing the engine over the phone.
Most of today’s AI “agents” try to understand user interfaces through fragments: HTML snippets, component trees, accessibility tags. But what they miss is the state — the real, runtime version of the page: what’s visible, disabled, active, hovered, collapsed, or hidden behind a conditional render.
The problem: context ≠ state
LLMs can handle structured text, but they can’t interpret UI behavior unless you tell them what actually happened.
Let’s say you have this login form:
<input type="text" name="username" />
<input type="password" name="password" disabled />
<button>Login</button>
From the static DOM, the model knows there’s a disabled password input — but it doesn’t know why. Maybe it’s waiting for a username, maybe it’s a security timer.
The only way to know is to look at the runtime DOM — after JavaScript runs, after frameworks render, after CSS hides things. That’s the true UI state.
Why “DOM → JSON” is the missing link
Instead of screenshots or raw HTML, a JSON snapshot of the current DOM gives the LLM real structure and semantics — the kind it can reason over.
Example:
{
  "form": {
    "username": { "visible": true, "value": "alex" },
    "password": { "visible": true, "disabled": false },
    "button": { "label": "Login", "enabled": true }
  },
  "context": {
    "url": "/auth/login",
    "timestamp": 1730185200
  }
}
This is not “data extraction.”
This is state capture — a contextual DOM snapshot that the LLM can process directly.
Once you give it this structure, the model no longer guesses. It reasons. It can generate tests, verify flows, and even write meaningful feedback like:
“Password field unlocked after username entered — validation succeeded.”
What “UI state for AI agents” really means
When agents manipulate interfaces, they don’t need full vision.
They need contextual grounding — to know what’s on screen and how it changed.
UI-state JSON acts as a context provider:
for testing – verify what changed after a user action;
for assistive AI – explain screen layouts for accessibility;
for automation – generate flows dynamically based on what’s rendered;
for design QA – ensure UI matches the expected visual logic.
Capturing the real state — how to do it
Capturing DOM at runtime isn’t trivial:
Frameworks (React, Vue, Angular) mutate nodes constantly.
Attributes like hidden, aria-, data- carry logic that plain parsers miss.
CSS may hide or alter visibility without changing the markup.
That’s why we built tools that hook into the browser itself — not build systems — to capture the live DOM. The output: a structured JSON that any LLM can read.
You don’t need a massive pipeline.
You just need a clean snapshot: one click → one JSON file with everything the model should “see.”
When JSON beats screenshots
| Approach | Pros | Cons | 
|---|---|---|
| Screenshot | Visual accuracy | No structure, model must “guess” | 
| HTML source | Lightweight | Missing runtime state | 
| DOM → JSON | Structured, contextual | Needs browser access | 
JSON is the middle ground — readable by both humans and machines.
You can diff it, validate it, and embed it directly in an LLM prompt.
Why this matters now
As LLMs become agents that interact with real products, context becomes currency.
Every prompt that fails due to “missing context” wastes tokens, time, and trust.
Giving models structured context — not guesses — is the next layer after prompt engineering.
DOM → JSON snapshots are just one implementation, but the principle applies broadly:
 Don’t describe reality. Serialize it. Element to LLM.
 Try it now
TL;DR
LLMs don’t need screenshots. They need state.
Runtime DOM → JSON turns a web page into structured, readable context.
This enables testing, automation, and reasoning with real UI data.
Tools like Element to LLM already make this one-click.
 

 
    
Top comments (0)