DEV Community

Cover image for # Runtime Snapshots #7 — Inside SiFR: The Schema That Makes LLMs See Web UIs
Alechko
Alechko

Posted on

# Runtime Snapshots #7 — Inside SiFR: The Schema That Makes LLMs See Web UIs

Raw HTML is noise. Screenshots burn tokens. Accessibility trees lose visual context.

So we built SiFR — a structured format that gives LLMs usable runtime UI context.

This post explains what's inside.


What is SiFR?

SiFR stands for Semantic Information for Representation.

(And yes — it's also meant to sound like "see far".)

SiFR is a JSON schema that captures the runtime state of a web page in a way that's:

  • Token-efficient (often 10–50× smaller than raw HTML on complex pages)
  • Semantically structured (models can reason over it without reconstructing the UI from markup)
  • Visually aware (preserves layout relationships without pixels)

It's not a scraper. It's not an accessibility tree.

It's a preprocessing layer that sits between the DOM and your AI — turning "what the browser rendered" into "what the model can reason about".


Why not just send HTML?

Let's use a real-world example: large e-commerce pages.

Raw HTML commonly contains:

  • deeply nested layout wrappers
  • duplicated markup for responsive layouts
  • client-side frameworks with non-semantic containers
  • hidden / disabled / off-screen elements that still exist in the DOM

So when you send HTML to an LLM, you're asking it to do two jobs:

  1. reconstruct runtime UI state
  2. then solve the task

That's where most failures happen.

Here's what a typical "find the button" path looks like in raw markup:

div > div > div > div > div > div > ... > button
Enter fullscreen mode Exit fullscreen mode

With SiFR, the same interface becomes "structure first, then the important elements".

For example:

{
  "id": "btn042",
  "text": "Add to Cart",
  "actions": ["clickable"],
  "salience": "high",
  "cluster": "product-actions"
}
Enter fullscreen mode Exit fullscreen mode

The LLM sees what it is, how it behaves, and which part of the page it belongs to — without reverse-engineering UI meaning from markup.


Anatomy of a SiFR Document

Every SiFR snapshot has five sections:

1) METADATA

Page-level context: URL, viewport size, capture timestamp, and capture stats.

{
  "url": "https://www.costco.com/...",
  "viewport": { "width": 1920, "height": 1080 },
  "stats": {
    "totalNodes": 2847,
    "salienceCounts": { "high": 12, "med": 89, "low": 2746 }
  }
}
Enter fullscreen mode Exit fullscreen mode

This is the "frame" the model needs before it reads anything else.


2) NODES

The structural skeleton — hierarchy without heavy details.

Think of it as the page's table of contents: what regions exist, what contains what, and what the high-level UI shape is.


3) SUMMARY

High-level layout blocks. This is where SiFR becomes "structure-first".

{
  "layoutBlocks": [
    { "role": "header", "contains": ["logo", "nav", "search"] },
    { "role": "sidebar", "contains": ["filters", "categories"] },
    { "role": "main", "contains": ["product-grid"] }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Before the model sees thousands of elements, it already has the page skeleton:
header at top, sidebar on the side, main content in the center.


4) DETAILS

Element-specific data: selectors, text, runtime visibility, interaction state, and relevant computed info.

{
  "btn042": {
    "selector": "button.add-to-cart",
    "text": "Add to Cart",
    "actions": ["clickable"],
    "styles": { "visible": true, "disabled": false }
  }
}
Enter fullscreen mode Exit fullscreen mode

This is where "runtime truth" matters: visible vs hidden, enabled vs disabled, actual text content, etc.


5) RELATIONS

Spatial relationships between important elements.

Not pixel coordinates — semantic positioning.

{
  "btn042": {
    "inside": "card-product-123",
    "below": "price-display",
    "rightOf": "quantity-selector"
  }
}
Enter fullscreen mode Exit fullscreen mode

The model can reason: "the Add to Cart button is inside the product card, below the price" — without seeing a single pixel.


Key Concepts

Visual Salience

Not all nodes matter equally.

SiFR assigns salience so the model focuses on signal:

  • High: primary actions, main content, user inputs
  • Medium: secondary nav, supporting info
  • Low: wrappers, containers, decorative elements

This is one of the biggest reasons SiFR stays usable on very large pages.


Layout Block Summarization

Instead of listing 3000 elements immediately, SiFR begins with a map:

PAGE STRUCTURE:
├── Header (logo, nav, search, cart)
├── Sidebar (filters)
└── Main
    ├── Breadcrumbs
    ├── Product Grid (24 items)
    └── Pagination
Enter fullscreen mode Exit fullscreen mode

Models don't "scan HTML". They build mental structure.
This gives them the structure up front.


Adaptive Complexity

A simple blog post doesn't need the same capture density as a complex dashboard.

SiFR adjusts automatically — more detail where it matters, less where it doesn't.

The goal is stable signal-to-noise, not maximal completeness.


Real Numbers

Here are representative examples from our internal benchmarks (token counts vary by capture options and page state):

Site HTML Tokens SiFR Tokens Compression
Costco ~1,280,000 ~24,000 ~53×
Amazon ~600,000 ~50,000 ~12×

On complex pages, SiFR makes LLM workflows practical where raw HTML often doesn't fit in context.


Try It Yourself

SiFR is implemented in Element to LLM — a free browser extension:

If you want to stress-test the format, try these two pages:

  1. costco.com — a realistic, framework-heavy enterprise UI
  2. arngren.net — extreme visual density and chaotic layout

Capture a snapshot and share:

  • what compression ratio did you get?
  • could your LLM reason about the structure?
  • did you find a site where SiFR struggles?

If it breaks — that's useful data. Seriously.


What SiFR Enables

With structured runtime UI context, LLMs can:

  • Debug layouts — paste JSON → spot z-index / visibility / layout issues
  • Generate selectors — Playwright/Cypress tests based on real DOM structure
  • Navigate autonomously — agents that understand "where to click" without screenshots
  • Recreate components — translate UI structure into React/Tailwind scaffolds

The Standard Question

We're actively developing SiFR as an open specification. Current version: v2.

The schema is strict and versioned, designed for automation pipelines — not just one-off prompt experiments.

If you're building LLM-powered UI tools, I'd love feedback on the format:

  • What feels missing?
  • What feels redundant?
  • What would make this more useful in your workflow?

Series Index

Previous posts:

  1. Taking a "fine" signup form and making it work
  2. a11y starts with runtime context
  3. QA That Speaks JSON

More


Links


Found a site that breaks SiFR? Drop it in the comments. That's the fastest way to improve the spec.

Top comments (0)