Why I Built a Filesystem for the Browser

#javascript #ai #opensource #webdev

Why I Built a Filesystem for the Browser

Browser automation for AI agents has an impedance mismatch problem. We feed agents high-entropy noise — raw HTML, pixel screenshots, brittle CSS selectors — and expect them to produce low-entropy, structured actions. The result is fragile, expensive, and full of silent failure modes.

DOMShell fixes this by exposing the browser's Accessibility Tree as a virtual filesystem. Agents navigate pages with ls, cd, grep, click — the same commands they already know from every Unix system in their training data. No new API to learn. No screenshots to parse. No selectors to break.

The Impedance Mismatch

Three approaches dominate browser automation today. All three are engineering mismatches.

Screenshots + vision models. The agent takes a screenshot, sends it to a vision model, gets back pixel coordinates, clicks, takes another screenshot. This burns vision tokens on every action, adds a full round-trip per interaction, and fails silently when coordinates shift. A button at (450, 320) moves to (450, 380) because a cookie banner loaded. The agent clicks empty space and doesn't know it. You debug ghost interactions.

CSS selectors / XPath. Trading pixel fragility for structural fragility. #main > div:nth-child(3) > ul > li:first-child > a breaks when a wrapper div gets added. Even semantic selectors like [data-testid="submit"] depend on the site's developers having added test IDs. Most haven't. And the agent still needs to reason over raw HTML to write the query — thousands of tokens of noise for a single interaction.

Coordinate-based clicks. Resolution-dependent. Viewport-dependent. Zoom-dependent. Responsive-layout-dependent. Every variable the agent can't control becomes a failure mode.

The common problem: all three approaches force the agent to work with a representation that wasn't designed for programmatic navigation.

The Accessibility Tree Is the Right Abstraction

Browsers already solved "navigate this page without looking at it." The Accessibility Tree — the internal representation that screen readers consume — is deterministic, semantic, and compact. Every button knows it's a button. Every link carries its href. Every input has a label and type. No invisible wrapper divs, no CSS noise, no layout-dependent coordinates.

The AX tree is the low-entropy, structured signal that agents need. The question was how to expose it.

The Filesystem Metaphor

The AX tree has a natural hierarchy. Container elements (navigation, main content, sidebars, forms) map to directories. Interactive elements (buttons, links, inputs) map to files. The whole structure maps cleanly to a filesystem — and every LLM already knows how to operate a filesystem.

ls to see what's on a page. cd to scope into a section. cat to inspect an element. grep to search. find to discover by type. click to interact. text to bulk-extract. These commands are in every model's training data. Zero-shot usability.

dom@shell:$ cd %here%
✓ Entered tab 386872589
  Title: Wikipedia
  URL:   https://www.wikipedia.org/

dom@shell:$ ls
[d] main/
[d] contentinfo/

dom@shell:$ cd main
dom@shell:$ tree 2
main/
├── [d] top_languages/
│   ├── [x] english_7141000_articles_link
│   ├── [x] deutsch_3099000_artikel_link
│   ├── [x] français_2740000_articles_link
│   └── ...
├── [d] search/
│   └── [x] search_input
└── [x] read_wikipedia_in_your_language_btn

The page is a directory tree. submit search_input "Artificial intelligence" navigates. The tree auto-refreshes. You're now looking at the article's filesystem.

No screenshots. No coordinates. No selectors.

Aggressive Flattening

The raw AX tree is noisy. Hundreds of wrapper nodes — role=generic, role=none, unnamed divs — exist for CSS layout, not semantics. Without filtering, you get listings of generic_1, generic_2, generic_3 with no indication of what anything is.

DOMShell's VFS mapper (vfs_mapper.ts) recursively flattens through non-semantic nodes, promoting their children up. If a role=generic node has one child, the child replaces it. The result is a clean tree where every visible element has a name derived from its accessible name and role: submit_btn, contact_us_link, email_input. Duplicates get disambiguated with _2, _3.

This is the core design decision. Minimizing node bloat maximizes agent signal-to-noise. Every flattened wrapper node is a token the model doesn't waste reasoning about.

Architecture

Three components, cleanly separated.

Chrome Extension (the kernel). A background service worker runs the shell: command parsing, AX tree traversal via CDP, filesystem mapping, DOM change detection. The side panel is a thin terminal (React + Xterm.js) — no logic, just I/O. The service worker reads the AX tree through chrome.debugger (Chrome DevTools Protocol 1.3), including cross-iframe discovery via Page.getFrameTree.

MCP Server (the bridge). A standalone Node.js HTTP server on localhost:3001 that any MCP-compatible client connects to — Claude Desktop, Claude Code, Cursor, Windsurf, Gemini CLI. Translates MCP tool calls into shell commands, pipes them to the extension over WebSocket (localhost:9876), streams results back. Multiple clients can connect simultaneously.

Security tiers. Read-only by default — agents can browse but not act. Write commands (click, type, scroll, js) require --allow-write. Cookie access (whoami) requires --allow-sensitive. Domain allowlists restrict which sites agents can operate on. Every command is audit-logged with timestamps. Auth tokens gate the WebSocket bridge.

The separation is deliberate. You can use DOMShell interactively without the MCP server. You can let an agent browse your tabs without it being able to click "Delete Account."

Benchmark: 50% Fewer Tool Calls

I ran 8 trials across 4 tasks using Claude Opus 4.6 with both DOMShell and Anthropic's built-in browser automation (Claude in Chrome). The metric was tool call count — directly proportional to latency and API cost.

DOMShell averaged 4.3 calls per task vs Claude in Chrome's 8.6 — a 50% reduction.

The biggest win was content extraction: DOMShell completed it in 2 calls (navigate + extract) where CiC needed 5-6. The filesystem metaphor lets the agent scope to the right section (cd main/article) and bulk-extract (text) in a single call, rather than navigating through read_page results iteratively.

Where CiC holds an edge is raw JavaScript execution — javascript_exec can batch multiple DOM operations into a single call. DOMShell's counter is the for + script + each pipeline, which collapses multi-page workflows into 1-2 calls by iterating over command output and replaying saved scripts across URLs.

The 50% reduction translates directly to cost and latency. For agents running in production — where every tool call is an API round-trip — halving the call count is a meaningful operational improvement.

What's Next

DOMShell is open source (MIT) and free. On the roadmap: a headless mode — a self-contained Chromium process that agents launch directly for CI pipelines and server-side automation where no visible browser is needed.

The for + script + each pipeline is where the compound efficiency gains live. Save a command sequence as a script, replay it across N URLs in a single call. O(2N) tool calls become O(2). For any agent doing research, extraction, or monitoring across multiple pages, that's a step change.

The browser is your filesystem. ls it.

GitHub: github.com/apireno/DOMShell
Project Page: DOMShell
Built by Pireno