David Hurley

Posted on Mar 30 • Originally published at dbhurley.com

Why I'm Building a Browser No Human Will Ever Use

#webdev #ai #rust #opensource

I am building a browser that will never render a single pixel.

No address bar. No tabs. No bookmarks. No window at all. Nobody will ever "open" Plasmate the way they open Chrome or Firefox or Safari. It has no visual interface because its consumer has no eyes.

This sounds absurd until you think about who is actually browsing the web in 2026.

The absurd premise that turned out to be obvious

When I first described Plasmate to other developers, the reaction was usually some version of: "Why not just use Playwright? Or Puppeteer? Or headless Chrome?" These are reasonable questions. They have headless modes. They can fetch pages and return HTML. The tools exist.

But consider what headless Chrome actually does when you ask it to "browse" a page. It launches a full rendering engine. It constructs a layout tree. It calculates pixel positions for every element. It composites layers. It rasterizes text into bitmap glyphs. It computes box shadows, border radii, gradient interpolations, and subpixel antialiasing. Then, if you are using it for an AI agent, you throw all of that away and extract the text.

This is like hiring a portrait painter to read you the newspaper. The painter's skills are extraordinary and completely irrelevant to the task.

Chrome was designed to turn HTML into pixels for human eyes. Every architectural decision, every optimization, every feature in Chrome serves that purpose. When you repurpose it for AI agents, you are paying the full cost of pixel rendering for a consumer that will never see a pixel.

The cost is not trivial. Chrome uses 200MB to 500MB of memory per page. It takes 1 to 3 seconds to render a complex page. At scale (an agent system monitoring hundreds of pages), this translates to gigabytes of RAM and minutes of compute spent on rendering that serves no purpose.

The question is not "can Chrome work for agents?" The answer is obviously yes. The question is "should agents pay the cost of pixel rendering when they need text comprehension?" The answer is obviously no.

That is why I am building a browser designed for a consumer that has no eyes.

What agents actually need from a browser

When I sat down to design Plasmate, I started by listing what an AI agent actually needs from a web browsing tool. The list was surprisingly short:

Fetch the page. Make an HTTP request, follow redirects, handle TLS, manage cookies. This is table stakes.

Execute JavaScript. Modern web pages are applications. The HTML document that arrives over the network is often a shell that loads and executes JavaScript to produce the actual content. An agent browser must execute this JavaScript to see what a human would see.

Understand the structure. This is where every existing tool falls short. The agent does not need a pixel grid. It needs to know: what regions exist on this page (navigation, main content, sidebar, footer)? What elements are in each region? What type is each element (heading, paragraph, link, button, form field)? What can the agent do with each element (click, type, select, toggle)?

Produce structured output. The agent's downstream consumer is a language model. The output must be structured, typed, and token-efficient. Raw HTML fails all three criteria.

Notice what is not on this list: rendering pixels, computing layout, displaying fonts, animating transitions, playing audio or video, painting gradients, or any of the hundreds of other things a visual browser does. These capabilities represent the majority of Chrome's complexity and the majority of its resource consumption.

The architecture: what Plasmate actually does

Plasmate is written in Rust. This was a deliberate choice for performance and memory safety, but also because Rust's ecosystem has excellent HTML parsing (html5ever, the same parser Firefox uses) and a mature V8 binding for JavaScript execution.

The pipeline has five stages:

Stage 1: Network fetch

An HTTP client fetches the page with full TLS, redirect, and cookie support. This is straightforward and shared with every other browser. The difference is that Plasmate's HTTP client does not load CSS files, font files, or image files. It loads the HTML document and JavaScript files only. Everything visual is irrelevant.

Stage 2: JavaScript execution

The HTML document is parsed with html5ever, and JavaScript is executed via V8. This is necessary because many pages generate their content dynamically. React, Vue, Angular, and Next.js applications produce an empty <div id="root"></div> in the initial HTML and populate it entirely through JavaScript.

JavaScript execution is where most of the complexity lives. V8 is a large, sophisticated engine. But even V8's execution is faster and lighter than Chrome's full pipeline because we skip the rendering, layout, and painting phases that normally follow DOM construction.

In v0.5.0, we added ICU data loading for Intl API support and raised script fetch limits to handle large SPA bundles (up to 3MB per script, 10MB total). We also added graceful degradation: when JavaScript execution fails, Plasmate compiles the pre-JavaScript HTML and returns partial SOM. Partial structured output is better than no output.

Stage 3: Region detection

Once the DOM is constructed (with JavaScript applied), Plasmate identifies semantic regions on the page. The detection uses a precedence chain:

First, ARIA roles. If an element has role="navigation" or role="main", that is definitive.

Second, HTML5 landmark elements. <nav>, <main>, <aside>, <header>, <footer>, <dialog>, and <form> map directly to region roles.

Third, class and ID heuristics. A <div class="main-content"> is likely the main region. A <div id="sidebar"> is likely an aside.

Fourth, link density analysis. A container with many links and few other elements is likely navigation, even without explicit markup.

Fifth, content heuristics. A container with copyright notices and privacy links is likely a footer.

Sixth, fallback. Anything not assigned to a specific region goes into a generic "content" region.

This detection produces a structured map of the page that no flat text extraction can replicate. An agent reading Plasmate output can go directly to the main region without scanning the entire page.

Stage 4: Element classification

Within each region, elements are classified by semantic role. Plasmate recognizes 15 element types: link, button, text_input, textarea, select, checkbox, radio, heading, image, list, table, paragraph, section, separator, and details (disclosure widgets).

Each element receives:

A stable identifier derived from SHA-256 hashing of the element's origin, role, accessible name, and DOM path. The same element on the same page always produces the same ID.

An html_id field preserving the original HTML id attribute (when present), enabling agents to resolve back to the DOM for interaction.

An actions array declaring what the agent can do: click, type, clear, select, or toggle.

An attrs object with role-specific data: href for links, level for headings, options for selects, headers and rows for tables, open state and summary text for details widgets.

An aria sub-object capturing dynamic widget state: expanded, selected, checked, disabled, current, pressed, hidden.

Semantic hints inferred from CSS class names: "primary," "danger," "disabled," "active." These are not visual styles but semantic signals that agents can use to understand element importance.

Stage 5: Serialization

The classified regions and elements are serialized as JSON conforming to the SOM Spec v1.0. The output is deterministic: the same page always produces the same JSON (modulo dynamic content changes).

The entire pipeline, from HTML string to SOM JSON, takes microseconds for the compilation step. The bottleneck is network fetch and JavaScript execution, not SOM compilation.

What this means in practice

The practical impact of this architecture is measurable. Across 50 real websites in our WebTaskBench evaluation:

Token reduction. SOM uses 8,301 tokens per page on average versus 33,181 for raw HTML. That is a 4x reduction. For navigation-heavy pages, the ratio reaches 5.4x. For adversarial pages (heavy ads, cookie banners, JavaScript noise), it reaches 6.0x.

Latency improvement. On Claude Sonnet 4, SOM is the fastest representation at 8.5 seconds average, compared to 16.2 seconds for HTML and 25.2 seconds for Markdown. Structured input reduces model reasoning time even compared to smaller unstructured input.

Memory efficiency. Plasmate uses approximately 30MB for 100 pages. Headless Chrome uses approximately 20GB for the same workload. The difference is the rendering pipeline that Plasmate skips entirely.

Speed. With daemon mode (a persistent process that keeps the browser warm), subsequent fetches complete in 200 to 400 milliseconds. Cold start is 2 to 3 seconds. This is competitive with simple HTTP-fetch-plus-readability tools while providing dramatically richer output.

Why not just use Markdown?

I get this question frequently, and it deserves a thorough answer.

Markdown extraction (via tools like Jina Reader, Firecrawl, or basic readability libraries) is the most common alternative to raw HTML for agent consumption. It works well for text extraction tasks. In our benchmark, Markdown uses 4,542 tokens per page, which is smaller than SOM's 8,301.

But Markdown has a fundamental limitation: it cannot represent interactivity. A Markdown document cannot tell an agent which text is a button, which is a link, which is a form field, or what actions are available. For an agent that needs to read an article and summarize it, this does not matter. For an agent that needs to fill a form, navigate a multi-step workflow, or click through search results, Markdown is blind.

The latency data reinforces this. On Claude, Markdown is slower than SOM despite being smaller. Our interpretation is that Claude spends additional reasoning time trying to reconstruct page structure from ambiguous text. When the task requires understanding what is interactive and what is not, the model has to guess from context rather than reading explicit declarations.

SOM occupies the middle ground: smaller than HTML, structured unlike Markdown, and fast for models to process because the semantic work is done at compile time rather than inference time.

The decisions that shaped the architecture

Several architectural decisions in Plasmate deserve explanation because they diverge from what most people would expect from a browser project.

Why Rust?

The obvious choice for a browser project is C++ (what Chrome and Firefox use) or JavaScript/TypeScript (what most developer tools use). Rust is unusual.

I chose Rust for three reasons. First, memory safety without garbage collection. A browser engine processes untrusted input (HTML, JavaScript, CSS) from arbitrary websites. Memory safety bugs in browser engines are the largest category of security vulnerabilities in Chrome and Firefox. Rust eliminates entire classes of these bugs at compile time.

Second, performance. The SOM compilation pipeline processes every element in the DOM tree, computes SHA-256 hashes for stable IDs, runs heuristic analysis on class names and content patterns, and serializes the result as JSON. In Rust, this entire pipeline runs in microseconds per page. In a garbage-collected language, the memory allocation patterns would introduce pauses.

Third, Rust's ecosystem has exactly the libraries needed. html5ever (the HTML parser from Mozilla's Servo project) and the V8 crate (Rust bindings for Google's JavaScript engine) provide production-quality foundations. The serde library provides zero-cost JSON serialization. These are not wrappers or bindings with impedance mismatches. They are native Rust libraries designed for high-performance text processing.

Why not use an existing rendering engine?

Blink (Chrome's rendering engine) and Gecko (Firefox's) are extraordinary pieces of engineering. They handle every edge case in CSS layout, every quirk in the HTML specification, and every performance optimization needed for smooth visual rendering.

But they are designed around a fundamental assumption: the output is a pixel grid on a screen. Every data structure, every caching strategy, every parallelization decision in these engines optimizes for that output target. Repurposing them for structured text output means carrying all of that complexity while using none of it.

Plasmate uses html5ever for DOM construction and V8 for JavaScript execution, but it builds its own pipeline for everything after that. Region detection, element classification, stable ID generation, ARIA state capture, and SOM serialization are all custom. This is not because existing engines cannot do these things. It is because doing them well requires different architectural assumptions than visual rendering demands.

Why is the output JSON and not something else?

SOM is serialized as JSON because JSON is the lingua franca of agent frameworks. Every programming language can parse JSON. Every LLM API accepts text that includes JSON. Every agent framework stores and transmits structured data as JSON.

We considered alternatives. Protocol Buffers would be smaller on the wire but harder for agents to read inline. XML would be semantically richer but token-heavy. YAML would be human-readable but ambiguous. MessagePack would be compact but binary.

JSON won because the primary consumer is a language model reading text. The JSON representation of a SOM document is directly readable by the model as context. No deserialization step is needed. The model sees the structure, the types, and the values in the same stream.

The contrarian bet

Building Plasmate required a contrarian belief: that the web needs a new browser, not a new wrapper around Chrome.

The wrapper approach is popular. Playwright, Puppeteer, Browserbase, Steel, and many others wrap Chrome and add convenience APIs on top. This works, and it is a reasonable strategy for tools that need pixel-perfect browser fidelity.

But wrapping Chrome means accepting Chrome's architectural assumptions. You pay for pixel rendering even when you do not need pixels. You accept Chrome's memory model, Chrome's process architecture, Chrome's security sandbox, and Chrome's update cycle. These are excellent design decisions for a visual browser. They are unnecessary constraints for an agent browser.

Plasmate does not wrap Chrome. It uses the same HTML parser (html5ever, from Mozilla) and the same JavaScript engine (V8, from Google), but it constructs its own pipeline around them. The pipeline is designed for a specific consumer (AI agents) with specific needs (structured output, token efficiency, semantic understanding) that Chrome's pipeline was never intended to serve.

This is the same bet I made with Mautic. The marketing automation tools existed (Marketo, HubSpot, Pardot). But they were designed for a different consumer with different constraints. Building for the underserved consumer, rather than wrapping the existing tools, produced a fundamentally better product for that audience.

What comes next

Plasmate today handles the compilation side: HTML in, SOM out. The next challenges are:

JavaScript coverage. Some heavily dynamic sites still fail during JavaScript execution. Khan Academy, certain React applications with complex hydration, and sites with aggressive anti-bot measures. Each failure is a reason for an agent to fall back to a simpler tool. Closing these gaps is engineering work, not architectural work, and it is ongoing.

WASM distribution. We recently published the SOM compiler as a WebAssembly module (npm install plasmate-wasm). This allows SOM compilation in any JavaScript runtime without a native binary. It is the first step toward making Plasmate's compilation available everywhere JavaScript runs: serverless functions, edge workers, browsers, CI pipelines.

Publisher adoption. The plasmate compile command accepts HTML from files or stdin without any network requests. Publishers can integrate SOM generation into their build pipelines and serve structured representations alongside HTML. Six properties already do this. Growing that number is as important as improving the compiler.

Standards. The SOM Spec, the Agent Web Protocol, and the robots.txt extension proposal are all published openly. We are participating in the W3C Community Group for Web Content and Browser AI. If SOM becomes a web standard rather than a Plasmate feature, the entire ecosystem benefits.

The point

I am building a browser no human will ever use because humans are no longer the only consumers of the web. AI agents browse billions of pages per day, and every one of those pages is served in a format designed for human eyes. The waste is staggering: we estimated in a recent paper that HTML presentation noise costs the agent ecosystem $1 billion to $5 billion per year in unnecessary token consumption.

The solution is not to make agents better at reading HTML. The solution is to give agents a format designed for them, the same way the web gave search engines sitemaps and applications gave consumers APIs.

Plasmate is that format's compiler. A browser built for a consumer that will never see a pixel, because it does not need to.

David Hurley is the founder of Plasmate Labs. Previously, he founded Mautic, the world's first open source marketing automation platform. He writes at dbhurley.com/blog and publishes research at dbhurley.com/papers.

DEV Community