Your AI agent can write code, analyze data, summarize documents, and debate philosophy.
It cannot look at a web page.
Not really. Not the way you do when you open a browser tab and see what's there — the layout, the buttons, the form that's half-loaded, the modal blocking the CTA.
Claude, ChatGPT, Cursor, Gemini — they're powerful. And in the browser, they're blind.
Three ways we've tried to give AI sight. All broken.
Screenshots.
The most common workaround. Take a screenshot, paste it into the chat. The AI "sees" pixels. But pixels have no element IDs, no computed styles, no z-index, no ARIA roles. The AI can't tell you which button is covered — just that something looks off. And vision tokens aren't cheap.
Raw HTML.
Dump the page source. 2MB of scripts, nav menus, analytics tags, third-party widgets. The context window fills up before the AI reads anything useful. The signal is buried under 600K tokens of noise.
Accessibility trees.
Better in theory. Structured, semantic. But AXTrees miss computed styles, miss positions, miss z-index stacks. What assistive tech perceives is not what gets rendered. In benchmarks, AXTree-based agents hit about 60% task accuracy.
None of these give AI what it needs: a structured, token-efficient representation of what the user actually sees.
What we've been building
If you've followed this series, you know about SiFR — Structured Interaction Format. A semantic representation of the live DOM that scores every element by importance, tags actions ([clickable], [fillable], [hoverable]), maps spatial relationships, and fits in 5–15KB instead of 200–400KB.
For the past months, SiFR lived inside a browser extension. You click, it captures, you paste the JSON into whatever LLM you use. Manual. Real active users. Zero marketing budget. People find it, use it, don't uninstall it.
That part works. But there's a ceiling.
Copy-paste is friction. The AI can see a snapshot — but only when you hand it one. It can't ask for another angle. It can't say "now scroll down and show me the footer." It can't click the button and see what happens.
We're removing that ceiling.
AI gets eyes. And hands.
We're building an MCP Server that connects AI tools directly to your browser.
Model Context Protocol is the open standard that Claude, ChatGPT, Cursor, and others use to connect to external tools. Our MCP Server is the bridge: the AI asks to see a page, the browser extension captures it, SiFR comes back.
But it's not just capture anymore.
The AI will be able to:
- See the page — structured SiFR snapshot, importance-ranked, 10–50x smaller than raw HTML
- Find elements — "show me all buttons on this form" returns IDs, labels, states, positions
- Click — the extension clicks, the AI sees the result
- Fill forms — type values, submit, confirm what changed
- Scroll — navigate the page, capture new state
- Detect changes — "what's different after that click?" → SiFR diff
Your AI agent stops guessing. It sees. And now it acts.
Why this matters
The era of AI agents is here. Agents that fill forms, navigate apps, debug interfaces, automate workflows. They all need one thing: accurate perception of the UI.
The current approach — raw HTML dumps, pixel screenshots, brittle CSS selectors — doesn't scale. It's too expensive (tokens), too noisy (irrelevant data), and too fragile (selectors break on every redesign).
SiFR through MCP is a different architecture:
Before: AI → raw HTML (200-400KB) → guess → hallucinate selectors
After: AI → SiFR (5-15KB) → see → act on what's actually there
The extension does the heavy lifting in your browser. The MCP Server routes. The AI gets structured, actionable context.
What's next
The MCP Server is in testing. The extension is live — Chrome, Firefox, and every Chromium browser.
When MCP launches, I'll post the full walkthrough here.
Follow if you want the announcement. Star the prompts repo if you want to start experimenting now — we're adding MCP-specific prompt templates.
Your tests check if code works. This checks if users can use it.
This is part 15 of the Runtime Snapshots series — exploring how structured browser data changes the way we build, test, and ship software.
Top comments (0)