DEV Community: David Hurley

The Publisher's Dilemma: Block, Tolerate, or Cooperate

David Hurley — Mon, 30 Mar 2026 15:09:44 +0000

If you run a website with meaningful traffic, you have already encountered this problem, whether you know it or not. AI agents are crawling your pages. They are consuming your content. And you are getting almost nothing back.

The numbers are stark. Cloudflare reported in August 2025 that Anthropic's crawlers fetch approximately 38,000 pages for every single page visit they refer back to publishers. OpenAI's ratio is better but still lopsided. Perplexity's ratio actually got worse over the course of 2025, with crawling volume increasing while referral traffic decreased.

Meanwhile, Google referrals to news sites have been declining since February 2025, coinciding with the expansion of AI Overviews. The Pew Research Center found that users are less likely to click through to a source when Google displays an AI summary at the top of search results.

The traffic pipeline that has sustained the web publishing model for two decades is weakening. And the new consumers that are replacing human visitors are not playing by the same economic rules.

The two options publishers have today

Right now, publishers face a binary choice.

Option A: Block everything. Add Disallow: / for GPTBot, ClaudeBot, and every other AI user agent in your robots.txt. The New York Times, The Atlantic, and hundreds of other publishers have taken this approach. It stops the crawling, eliminates the infrastructure cost, and sends a clear signal about content rights.

But it also makes you invisible to AI. When someone asks ChatGPT or Claude about a topic you cover deeply, your content will not be in the answer. As AI assistants become a primary interface for information discovery (and the trajectory is unmistakable), publishers who block agents are opting out of an increasingly important distribution channel. It is the equivalent of blocking Googlebot in 2005.

Option B: Allow everything. Do nothing. Let the crawlers in. Serve them the same HTML you serve to browsers, with all the CSS, JavaScript, tracking scripts, ad markup, and layout containers. Hope that being included in AI training data and retrieval results translates to some indirect benefit.

The cost of this option is concrete and measurable. Every agent request hits your CDN, your origin server, and your rendering pipeline. For sites with server-side rendering, each request triggers a full page build. The agent receives your 2.5MB page, extracts maybe 25% of the content, discards the rest, and sends no referral traffic in return.

What "cooperate" means

There is a third option that most publishers have not considered, because until recently it did not exist as a practical choice. Instead of blocking agents or serving them raw HTML, you can serve them a format designed for their consumption model.

The idea is content negotiation for the agent era. When a browser requests your page, you serve HTML (as you always have). When an AI agent requests your page, you serve a structured semantic representation: smaller, faster to process, and containing only the information the agent actually needs.

This is not a new concept in web architecture. HTTP content negotiation has existed since the 1990s. Servers already serve different content types based on the Accept header (JSON for APIs, HTML for browsers, images in different formats based on browser support). Extending this pattern to serve structured representations for AI agents is a natural evolution.

The structured format we have been developing is called SOM (Semantic Object Model). It is a JSON document that organizes page content into typed regions (navigation, main content, sidebar, footer) containing typed elements (headings, paragraphs, links, buttons, form fields) with explicit declarations of available actions. It preserves what agents need and discards what they do not.

But the specific format matters less than the principle: publishers can choose what agents receive, rather than letting agents extract whatever they can from HTML.

The economics of cooperation

I published a detailed cost-benefit analysis as a research paper, but here is the summary.

We modeled four publisher strategies across three tiers (small blog at 10K agent requests/month, mid-size news site at 1M/month, and large publisher at 50M/month) and compared the annual infrastructure costs of each.

The savings come from three sources:

Bandwidth reduction. A structured representation is 5-15KB compared to 30-100KB+ for a full HTML page with its associated resources. When you are serving millions of agent requests per month, the bandwidth difference is substantial.

Compute reduction. This is the big one for dynamic sites. Server-side rendered pages require a full render cycle per request: template compilation, database queries, component rendering, HTML serialization. A cached SOM representation is a static JSON file served from CDN with zero origin compute. For sites where SSR costs dominate (which is most modern web applications), this is where the majority of savings come from.

Bot management simplification. Publishers currently spend significant effort on WAF rules, rate limiting, and bot detection to manage agent traffic. Cooperative serving through a dedicated endpoint reduces the adversarial dynamic. You are explicitly offering content for agents through a channel you control, rather than playing defense against crawlers hitting your production infrastructure.

Beyond cost: what cooperation gets you

Cost savings are the easiest benefit to quantify, but they are not the most important one.

Content control

When you serve raw HTML to agents, you have no control over what they extract. They might grab a sidebar advertisement and present it as your editorial content. They might pull a number from a related-articles widget and attribute it to your reporting. The Air Canada chatbot got a bereavement policy wrong because it could not distinguish policy text from surrounding content.

When you serve a structured representation, you define exactly what the agent receives. The main article is explicitly labeled as main content. The sidebar is labeled as complementary. Advertisements and tracking are excluded entirely. You are curating the agent's view of your page, the same way you curate the search engine's view through structured data and the API consumer's view through endpoint design.

Attribution

This is the piece that excites me most about the cooperative model. A structured representation can include provenance metadata: unique identifiers for each content element that enable agents to cite specific sources. Not just "according to nytimes.com," but "according to paragraph 3 of the main content region on nytimes.com/article/xyz, published March 15, 2026."

This creates the foundation for an attribution system that does not exist today. When agents cite structured sources with element-level provenance, publishers can:

Track which content elements are most frequently cited by agents
Measure their influence in the agent-mediated information ecosystem
Build a case for licensing based on demonstrated usage data

Future-proofing

Agent traffic is growing at 18-30% year over year. Some individual crawlers are growing 300%+ annually. The publishers who build cooperative infrastructure now will have it in place when agent-mediated content discovery becomes the primary channel. The publishers who wait will scramble to catch up, the same way many scrambled to adopt SEO, social sharing, and mobile-responsive design after those transitions were already well underway.

How it actually works

There are three levels of implementation, from simple to comprehensive:

Level 1: Static SOM files

Generate SOM representations at build time and serve them as static files alongside your HTML. This works for any site with a build pipeline (Hugo, Next.js, Astro, Jekyll, etc.).

# Install the SOM compiler
npm install plasmate-wasm

# In your build script, after generating HTML:
for file in public/**/*.html; do
  plasmate compile "$file" > "${file%.html}.som.json"
done

Agents discover SOM files through a .well-known/som.json manifest or <link rel="alternate"> tags in your HTML. No server-side logic required.

Level 2: Content negotiation

Add a middleware that checks the Accept header or user agent and serves SOM to agents, HTML to browsers. This is a few lines in any web framework:

// Express/Next.js middleware
app.use((req, res, next) => {
  if (req.accepts('application/som+json') || isAgentUA(req)) {
    return res.sendFile(somPathFor(req.path));
  }
  next(); // serve HTML as normal
});

Level 3: Robots.txt directives

Declare your SOM endpoint in robots.txt so agents know to request it:

User-agent: *
SOM-Endpoint: /.well-known/som/{path}.json
SOM-Version: 1.0

This is the cooperative content negotiation model described in our robots.txt extension proposal.

The licensing question

I would be dishonest if I did not address the elephant in the room. Some publishers are pursuing licensing deals with AI companies (OpenAI has signed agreements with several major publishers, and Perplexity has launched a revenue-sharing program). If you can get paid directly for your content, that changes the calculus.

But licensing deals are available to a handful of the largest publishers. The vast majority of websites, the millions of blogs, documentation sites, small news outlets, government portals, community forums, and niche publications that collectively make up the long tail of the web, will never get a licensing call from OpenAI.

For those publishers, the choice is not "license or cooperate." It is "serve raw HTML to agents that extract your content without attribution or compensation" versus "serve structured content that costs you less, gives you more control, and positions you for attribution when the infrastructure matures."

Cooperation is not a substitute for licensing. It is the practical path for publishers who will never get a licensing deal but still want to reduce costs, maintain control, and participate in the agent-mediated web.

The precedent

Every major transition in how the web is consumed has followed the same pattern:

A new consumer class emerges (search engines, applications, agents)
Publishers initially resist or ignore the new consumer
An economic incentive emerges (search visibility, API integrations, agent inclusion)
Publishers adopt purpose-built infrastructure (sitemaps, APIs, ???)
Early adopters gain structural advantages that late adopters struggle to overcome

We are between steps 2 and 3 right now. Most publishers are either resisting (blocking agents) or ignoring (serving raw HTML). The economic incentive is forming: agent traffic is growing, agent-mediated discovery is replacing some search traffic, and the cost of serving raw HTML to agents is quantifiable.

The infrastructure for step 4 is being built. SOM, AWP, the robots.txt extensions, the content negotiation patterns. Whether these specific technologies become the standards or something else does, the directional trend is clear: publishers will eventually serve structured content to agents, because the economics and the incentives demand it.

The publishers who engage now, while the infrastructure is still forming, get to shape those terms. The publishers who wait will accept whatever standards emerge without their input.

I know which position I would rather be in. And I suspect most publishers, once they see the numbers, will agree.

The full cost-benefit analysis is published as a research paper with worked examples for three publisher tiers, sensitivity analysis across traffic growth scenarios, and case studies for news, e-commerce, and documentation sites.

David Hurley is the founder of Plasmate Labs. Previously, he founded Mautic, the world's first open source marketing automation platform. He writes at dbhurley.com/blog and publishes research at dbhurley.com/papers.

When AI Reads the Web Wrong

David Hurley — Mon, 30 Mar 2026 14:27:37 +0000

In February 2024, the Civil Resolution Tribunal of British Columbia ordered Air Canada to pay a passenger named Jake Moffatt a partial refund. The reason: Air Canada's AI chatbot had told Moffatt he could book a full-fare flight and apply for a bereavement discount afterward. That was wrong. The airline's actual policy required the discount to be requested before booking. The chatbot read the airline's own website and got the policy backward.

Air Canada tried to argue that the chatbot was "a separate legal entity that is responsible for its own actions." The tribunal was not persuaded.

This story gets cited as a cautionary tale about AI hallucination. But I think it illustrates something more specific and more fixable. The chatbot did not hallucinate out of thin air. It read a real web page containing real policy information and then misinterpreted what it found. The page almost certainly had the correct policy buried somewhere in the HTML, surrounded by navigation menus, footer links, promotional banners, cookie consent dialogs, and whatever else airlines pack into their web pages these days.

The chatbot's error was not a failure of intelligence. It was a failure of comprehension, caused by the format of the input.

The chatbot did not hallucinate out of thin air. It read a real page with real policy information and misinterpreted what it found. The format of the input caused the error, not the quality of the model.

The format problem nobody talks about

When an AI agent "reads" a web page, it does not see what you see. You see a clean layout with headings, paragraphs, buttons, and images, all arranged in a visual hierarchy that your brain processes in milliseconds. The AI agent sees something very different: tens of thousands of tokens of raw HTML markup, most of which has nothing to do with the actual content of the page.

I have been measuring this for the past several months as part of building Plasmate. Across 50 real websites, the average web page contains about 33,000 tokens of HTML. Of those, roughly 25% is actual content: the text, the headings, the links, the things you would want an AI to understand. The remaining 75% is presentation markup: CSS class names, inline styles, JavaScript, tracking pixels, layout containers, ad slots, data attributes, SVG icons, and structural dividers that exist only to make the page look right in a browser.

The AI agent processes all of it. It pays for all of it (literally, token by token). And it has to figure out, on its own, which parts matter and which parts are noise.

Imagine handing someone a 400-page book and telling them that 300 of those pages are blank or contain random numbers, but the 100 pages with actual content are scattered throughout and not marked in any way. Then ask them to summarize the book accurately. They might get it right most of the time. But sometimes they will latch onto a random number from a blank page and present it as a fact. That is more or less what we are doing to AI agents every time they browse the web.

Four ways agents get web pages wrong

After spending months studying how AI agents interact with web content, I have started to see patterns in their errors. Not all mistakes are the same. They fall into distinct categories, and the category tells you something about what caused the mistake.

Structural errors. The agent invents page elements that do not exist. "Click the Subscribe button in the sidebar" when there is no such button. This happens most often when the agent is working from a text-only version of the page (like markdown) that strips out all structural information. The agent knows buttons probably exist on most pages, so it guesses. Sometimes it guesses wrong.

Content errors. The agent reports facts that are not on the page, or reports the wrong version of a fact that is. A price of $49.99 when the page shows $59.99. This is the classic hallucination, and it can be triggered by noisy input. When the agent is processing 33,000 tokens and trying to find a single number, the probability of grabbing the wrong one goes up with the amount of noise in the input.

Attribution errors. The agent finds the right information but attributes it to the wrong part of the page. It reports a statistic from the sidebar as if it were from the main article. It confuses a promotional offer with the actual product price. This happens because the agent cannot distinguish between page regions. Everything arrives as one flat stream of text, and the agent has to guess where the main content ends and the sidebar begins.

Inference errors. The agent draws conclusions that the page does not support. A product is described as "popular" and the agent reports it as "the best-selling item." This type of error is less about the input format and more about the model's tendency to fill in gaps. But noisier, more confusing input makes inference errors more likely, because the agent has less confidence in what it actually found and more temptation to extrapolate.

Which of these error types caused the Air Canada chatbot to get the bereavement policy wrong? Almost certainly a combination of content and attribution errors. The correct policy was on the page, buried in noise.

The Air Canada case was probably a combination of content and attribution errors. The correct bereavement policy was on the page, but surrounded by enough other content that the chatbot either found the wrong paragraph or misattributed a condition from one section to another.

Why markdown is not the answer

The most common response to the "HTML is too noisy" problem is to convert web pages to markdown before feeding them to an AI agent. Strip out the HTML tags, keep the text. This is what most AI agent frameworks do by default: LangChain, LlamaIndex, CrewAI, and others all convert web pages to plain text or markdown before processing.

Markdown does solve the noise problem. A page that was 33,000 tokens of HTML becomes about 4,500 tokens of markdown. That is a huge improvement in efficiency. But it creates a new problem: markdown throws away all the structural information along with the noise.

A markdown representation of a web page cannot tell you which text is a button and which is a heading. It cannot tell you which elements are interactive and which are static. It cannot distinguish between the main content area and the sidebar. It cannot tell you what form fields are available or what options are in a dropdown menu.

For simple reading tasks, this does not matter. If you just need to extract the text of an article, markdown is great. But for anything that requires understanding the structure of the page, knowing what you can click, figuring out how to fill a form, or navigating a multi-step workflow, markdown is blind.

This creates an awkward situation for AI agent developers. They use one format (markdown) when they need to read pages, and a completely different format (raw HTML with DOM selectors) when they need to interact with pages. Two systems, two sets of failure modes, no unified understanding of the page.

What if the page told the agent what it needed to know?

This is the question I have been working on. Not "how do we make AI smarter at reading HTML," but "how do we give AI a format that is actually designed for it?"

The idea behind the Semantic Object Model (SOM) is straightforward: take a web page and compile it into a structured representation that preserves what an agent needs (the content, the element types, the interactive affordances, the page regions) while discarding what it does not (the CSS, the scripts, the tracking, the layout containers, the visual presentation).

The output is a JSON document that organizes the page into typed regions (navigation, main content, sidebar, footer) containing typed elements (headings, paragraphs, links, buttons, form fields) with explicit declarations of what actions are available (click, type, select, toggle). An agent reading a SOM document knows exactly what is on the page, where it is, what type of element it is, and what it can do with it.

This is not a summary or an extraction. It is a compiled representation of the full page, preserving all the semantic information while eliminating the visual presentation layer. Think of it as the difference between giving someone a blueprint of a building versus giving them a photograph. The photograph is richer in visual detail, but the blueprint tells you what every room is for, where the doors are, and which ones are locked.

The numbers

We have been running benchmarks across 50 real websites with two different AI models (GPT-4o and Claude Sonnet 4). The efficiency numbers are dramatic:

SOM uses about 8,300 tokens per page compared to 33,000 for raw HTML. That is a 4x reduction. For navigation-heavy pages, the ratio reaches 5.4x. For pages with heavy advertising, 6x.

But efficiency is not the point of this post. The point is correctness.

In our latest research, we set up 150 tasks across six categories: extracting specific facts, comparing information across page sections, identifying navigation structure, summarizing content, handling noisy pages with lots of ads, and identifying interactive elements. We tested each task with all three formats (HTML, markdown, SOM) across four different AI models.

The results are still being finalized, but the patterns are clear. For tasks that require understanding page structure (navigation, interactive elements, adversarial pages with lots of noise), structured representations produce measurably better results. The agent makes fewer structural errors because it does not have to guess about page structure. It makes fewer attribution errors because regions are explicitly labeled. It makes fewer content errors on noisy pages because the noise has been filtered out at compile time rather than at inference time.

Perhaps the most interesting finding is about speed. On Claude, SOM is faster than markdown despite using nearly twice as many tokens.

Claude Sonnet 4 with HTML input: 16.2 seconds per task. With SOM input: 8.5 seconds. Nearly 2x faster, despite SOM having more tokens than markdown. Structured input reduces the work the model has to do.

Our interpretation is that structured input reduces the amount of work the model has to do to understand the page. When the structure is explicit, the model spends less time reasoning about "what is this element?" and "where does the sidebar end?" and more time reasoning about the actual question.

The part that surprised me

One thing we built into SOM that I initially considered a nice-to-have turned out to be one of the most important features: provenance tracking. Every element in a SOM document has a stable identifier. When an agent extracts a fact from a SOM page, the system can record which specific element that fact came from.

This means you can programmatically verify an agent's claims. If the agent says "the product costs $49.99," you can check whether element e_a3f2b1 in the main region actually contains that price. If it does, the claim is verified. If it does not, you know the agent made an error, and you know it before the claim reaches a user.

Compare this to verifying a claim against raw HTML or markdown. The best you can do is search for the string "$49.99" somewhere in the document. If it appears in an ad or in a "customers also bought" section rather than the product listing, you cannot tell the difference. The claim looks verified even though the agent found the number in the wrong place.
For high-stakes applications (medical information, financial advice, legal policy lookup), this is the difference between an AI assistant you can trust and one you cannot.

Back to Air Canada

Would structured representations have prevented the Air Canada chatbot error? I think they would have helped significantly. The bereavement policy was on the page. The issue was that the chatbot could not reliably distinguish the policy text from the surrounding content, or it confused conditions from adjacent sections.

A SOM representation of that page would have placed the bereavement policy in a clearly labeled content region, with each policy condition as a distinct paragraph element. The chatbot would have received a clean, structured view of the policy rather than a wall of HTML that happened to contain the policy somewhere within it.

Would that guarantee a correct answer? No. AI agents can still make inference errors regardless of input format. But it would eliminate the structural and attribution errors that are caused by noisy, ambiguous input. And for a chatbot handling customer-facing policy questions, eliminating those error categories is worth a lot.

What this means going forward

The web was built for human eyes. Every page on the internet is designed to be rendered as pixels on a screen. That made perfect sense when humans were the only consumers of web content.

They are not anymore. AI agents now account for a growing share of web traffic. Cloudflare reported that crawler traffic grew 18% in a single year, with some AI crawlers growing over 300%. These agents are browsing billions of pages, and every one of those pages is served in a format designed for someone who will never use it.

The fix is not to make agents better at reading HTML. It is to give agents a format designed for them, the same way we gave search engines sitemaps and gave applications APIs. Each new class of web consumer has eventually gotten infrastructure designed for its consumption model. AI agents are the next consumer class, and they need the same.

I published a detailed research paper on information fidelity that goes deep on the methodology, the hallucination taxonomy, and the experimental framework. If you are building agent systems or evaluating web content representations, the paper has the technical depth. This post is the accessible version: AI agents read the web wrong because we are feeding them the wrong format, and there is a better way.
David Hurley is the founder of Plasmate Labs. Previously, he founded Mautic, the world's first open source marketing automation platform. He writes at dbhurley.com/blog and publishes research at dbhurley.com/papers.

The Billion Dollar Tax on AI Agents

David Hurley — Mon, 30 Mar 2026 14:21:43 +0000

I want to walk you through a number that I keep coming back to, one that I think deserves more attention than it gets. The number is somewhere between one billion and five billion dollars per year. That is our estimate of how much the AI industry spends, collectively, processing web page markup that no agent will ever use.

Not because the agents are inefficient. Not because the models are wasteful. Because the web serves content in a format designed for human eyes, and agents are paying the full cost of that visual presentation layer every time they read a page.

Let me show you where this number comes from.

Start with a single web page

Pick any web page. Go to your favorite news site, an e-commerce product page, a documentation site, a government portal. Right-click, view source. What you see is HTML: a mix of content (the text, the links, the headings) and presentation (CSS classes, inline styles, layout containers, tracking scripts, ad markup, SVG icons, data attributes).

The HTTP Archive's 2025 Web Almanac reports that the median home page now weighs 2.86 MB on desktop and 2.56 MB on mobile. But what matters for AI agents is not the total page weight (which includes images, fonts, and videos that agents do not load). What matters is the HTML document itself and the JavaScript that populates it.

Across our 50-site benchmark, the average rendered web page contains about 33,000 tokens when fed to a language model tokenizer. That is the number the agent actually processes: 33,000 tokens of HTML markup shoved into the model's context window.

How much of that is actual content? We measured this by comparing raw HTML to SOM (a structured representation that preserves semantic content while stripping presentation). The SOM version of the same pages averages about 8,300 tokens.

That means roughly 24,900 tokens per page, about 75%, encode nothing that the agent needs. CSS class names like flex items-center justify-between px-4 py-2 bg-white dark:bg-gray-900. Tracking scripts. Layout dividers. Ad containers. Cookie consent dialogs. SVG path data for icons. The visual scaffolding that makes a page look right in a browser but contributes nothing to an agent's understanding of what the page says or what you can do on it.
The agent processes all of it. Every token. And somebody pays for every token.

Scale it up

Now take that per-page waste and multiply it by the number of pages AI agents browse every day. This is where the math gets interesting and, I will be honest, where it requires some estimation. The exact number of daily agent page fetches is not publicly disclosed by any major AI company. But we can build a reasonable model from what is public.

Cloudflare's 2025 Year in Review reports that approximately 30% of all web traffic is bot traffic. AI-specific crawlers (GPTBot, ClaudeBot, Meta-ExternalAgent, Amazonbot, and others) account for about 4.2% of all HTML request traffic, separate from Googlebot's 4.5%. And this is growing fast: from May 2024 to May 2025, AI crawler traffic grew 18% overall, with GPTBot specifically growing 305% in that period.

But raw crawler traffic (which is mostly training data collection) is different from what agents do when they browse on behalf of users. We need to separate the two.

When you ask ChatGPT to look something up and it browses the web, that is a user-action page fetch. When GPTBot crawls Wikipedia at 3 AM to build a training dataset, that is training crawl traffic. The training traffic is much larger in volume, but the user-action traffic is what incurs per-request LLM inference costs, because the fetched page goes directly into a model's context window.

Using public user count data (OpenAI has reported over 300 million weekly active users for ChatGPT, Perplexity has disclosed 20 million monthly users, Anthropic has not disclosed but is estimated at several million Claude users), we built a bottom-up model of daily page fetches. The estimate: roughly 400 million user-action page fetches per day, across all major AI agents combined.

400 million pages. 24,900 wasted tokens each. At a weighted average API price of $0.75 per million input tokens (blending GPT-4o at $2.50, GPT-4o Mini at $0.15, Claude Sonnet at $3.00, Gemini at $1.25).

The math: 400M pages/day x 24,900 waste tokens x $0.75/M tokens x 365 days = approximately $2.7 billion per year.

A separate top-down model calibrated against Cloudflare's total traffic volume produces a higher estimate. Combining the two approaches, we bracket the annual industry-wide token waste at $1 billion to $5 billion per year.

400M pages/day × 24,900 waste tokens × $0.75/M tokens × 365 days = $2.7 billion/year. And this only counts user-triggered browsing, not autonomous agent workloads.

That is a lot of money. Is it real?

It is worth being explicit about the uncertainties here. The exact number depends heavily on three variables: how many pages agents fetch per day (our biggest source of uncertainty), the effective LLM price per token (which is falling and varies by model), and how much preprocessing agents already do (some strip HTML to markdown, some truncate, some cache).

We account for all of these. Our model assumes that 45% of agents already convert to markdown (which reduces waste by about 70%), 30% truncate HTML (reducing waste by about 30%), and 15% of fetches are cache hits. After these adjustments, the effective waste per page drops from 24,900 tokens to about 13,300 tokens. The billion-dollar figure already includes these reductions.

You could argue the number is lower if agent usage grows slower than projected, or if LLM prices continue to drop. Both are plausible. But agent usage is growing much faster than prices are dropping. GPTBot's traffic grew 305% in a single year. LLM input prices have not fallen 305% in the same period. The total cost is going up, not down.

You could also argue the number is higher, because our model only counts user-action fetches (where someone explicitly asks an agent to browse). It does not count autonomous agent workloads: monitoring services, price comparison engines, research pipelines, and other machine-to-machine browsing that runs continuously without human prompting. Those workloads are growing rapidly and process far more pages per instance than a human user would.

The honest answer: we believe $1B to $5B is a reasonable bracket. The central estimate of $2.7B probably understates autonomous workloads and overstates the effectiveness of current preprocessing.

Where does the money actually go?

This is the part that I find most frustrating. The wasted tokens do not simply vanish. They consume real resources at every stage of the inference pipeline.

GPU compute. Every input token passes through the model's attention layers. The self-attention mechanism has quadratic complexity with respect to input length: doubling the input more than doubles the compute. When 75% of the input is presentation noise, the model spends the majority of its attention budget on tokens that carry no useful information. This is not just a billing abstraction. It is actual electricity consumed by actual GPUs running actual matrix multiplications on CSS class names.

Context window displacement. Language models have finite context windows. A 128K-token window sounds generous until you realize that a single HTML page consumes 33K of it. An agent that needs to analyze five pages in a single pass can only fit one or two in HTML, but could fit all five in a structured format. The wasted tokens directly limit what the agent can reason about in a single inference call.

Latency. More input tokens mean longer time-to-first-token. In our WebTaskBench evaluation, we measured the latency impact directly. On Claude Sonnet 4, the average task took 16.2 seconds with raw HTML input versus 8.5 seconds with SOM input. The agent was nearly twice as fast, simply because it had less noise to process. GPT-4o showed a similar pattern: 2.74 seconds with HTML versus 1.44 seconds with SOM.

That latency difference is not just about user experience (though users do notice when their AI assistant takes 16 seconds instead of 8). It is about throughput. A serving cluster that can handle N requests per second with HTML input can handle roughly 2N requests per second with structured input. The infrastructure savings compound on top of the token cost savings.

What nobody talks about: the crawl-to-click gap

There is a dimension to this problem that goes beyond token costs, and I think it is the more important one for the long-term health of the web.

Cloudflare published a remarkable dataset in August 2025 titled "The crawl-to-click gap." The core finding: AI crawlers consume vastly more content than they send back as referral traffic. The ratios are staggering.

In July 2025, Anthropic's crawlers fetched approximately 38,000 pages for every single page visit they referred back to a publisher. That is a 38,000:1 crawl-to-refer ratio. Earlier in the year it was 286,000:1. Perplexity's ratio actually got worse over 2025, with more crawling but fewer referrals, reaching 194:1 by July.

Compare this to Google. For all the complaints about Google hoarding traffic (and the data does show Google referrals to news sites declining since February 2025, coinciding with the expansion of AI Overviews), Google's crawl-to-refer ratio is in the single digits. It crawls pages and sends users back.

AI companies crawl pages and keep the value.
This is the economic context that makes the token waste problem more than an efficiency issue. Publishers are paying to serve content to agents that extract the value and return nothing. The infrastructure costs of serving those requests (bandwidth, compute, CDN, origin rendering) come out of the publisher's budget. And the content served in those requests is 75% visual presentation that the agent throws away.

The publisher pays to generate it. The agent pays to process it. And neither party gets value from it.

What the ten biggest agent frameworks actually do

Part of our research involved surveying the default web content handling in 10 major agent frameworks. The results were, frankly, depressing.

LangChain, LlamaIndex, and CrewAI, the three most popular agent orchestration frameworks, all default to BeautifulSoup's get_text() method. This is the most aggressive possible extraction: it strips every HTML tag and returns flat, unstructured text. The result is small (good for tokens) but has lost all structural information: element types, interactive affordances, page regions, everything that distinguishes a button from a heading from a link.

Dedicated scraping tools like Crawl4AI, Firecrawl, and Jina Reader use markdown extraction, which is more sophisticated. Markdown preserves headings, links, and basic formatting. But it still discards element types (a button looks the same as a link), interactive affordances (you cannot tell what you can click), and page regions (main content is indistinguishable from sidebar content).

Browser Use and Stagehand use accessibility tree extraction, which is the closest to a structured representation. But accessibility trees are designed for screen readers, not AI agents. They include every element on the page (including the 200 decorative ARIA landmarks in a typical site footer) and produce output that is often as verbose as the original HTML.

None of the ten frameworks we surveyed use a structured semantic representation by default. Zero out of ten.

The entire ecosystem is either stripping web pages down to bare text (losing structure) or passing through raw HTML (paying for noise). There is no middle ground in production use today.

LangChain, LlamaIndex, CrewAI: BeautifulSoup get_text(). Strips all HTML, returns flat text. Minimal tokens, zero structure.

Crawl4AI: Custom HTML-to-Markdown. Preserves headings and links, loses element types and affordances.

Firecrawl: Readability + Markdown. Good for article extraction, blind to interactive elements.

Jina Reader: Custom extraction to Markdown. Similar tradeoffs to Firecrawl.

AutoGPT: Delegates to Jina/Firecrawl. Inherits their limitations.

Browser Use: Accessibility tree + DOM. Closest to structured, but designed for screen readers, not agents.

Stagehand: Accessibility tree. Same verbose output issue as Browser Use.

The number that keeps me up at night

Here is the calculation that I keep returning to. Take the WebTaskBench data: SOM uses 8,301 tokens per page on average. Raw HTML uses 33,181. The difference is 24,880 tokens.

Multiply by 400 million pages per day. That is 9.95 trillion wasted tokens per day. Over a year, approximately 3.6 quadrillion tokens. At $0.75 per million tokens, that is $2.7 billion.

But here is what keeps me up: that 400 million daily page fetch number is from our conservative bottom-up model, which only counts explicit user-triggered browsing. The top-down model calibrated against Cloudflare's total AI bot traffic suggests the real number could be 3 to 4x higher when you include autonomous agent workloads.

And agent traffic is growing at 18 to 30% year over year, while LLM prices are dropping maybe 30 to 50% per generation (roughly every 6 to 12 months). The volume growth is outpacing the price decline. The total cost curve is going up.

If current trends hold, by 2027 the annual waste could exceed $10 billion. Not because anyone is being negligent, but because the fundamental mismatch between the format the web serves (visual HTML) and the format agents need (structured semantic content) will become more expensive with every page added to the web and every new agent deployed.

This is a solvable problem

I want to be clear about something: this is not a doom-and-gloom piece. The waste is real, the numbers are large, but the problem is entirely solvable with existing technology.

The web has solved this exact problem before, for other consumer classes. When search engines emerged as a new web consumer in the late 1990s, they struggled with HTML too. The web responded by inventing sitemaps, robots.txt, and structured data (Schema.org, JSON-LD, OpenGraph). These machine-readable layers sit alongside the human-readable HTML and provide crawlers with the structured information they need without requiring them to parse visual markup.

When applications emerged as a web consumer in the mid-2000s, the web responded again: REST APIs, GraphQL, webhooks. Purpose-built interfaces for programmatic consumption.

AI agents are the fourth consumer of the web. They need the same thing: a purpose-built representation designed for their consumption model. Not raw HTML (too noisy), not plain text (too lossy), but something in between that preserves what agents need and discards what they do not.

That is what we are building with Plasmate and the Semantic Object Model. But honestly, the specific technology matters less than the recognition that the problem exists and is getting more expensive every day. If someone builds a better solution than SOM, great. The industry still saves billions.

The full analysis, including the complete estimation methodology, sensitivity analysis across pricing scenarios, and the framework survey data, is published as a research paper on this site. If you work on agent infrastructure, pricing models, or web content delivery, I think you will find the data useful.

What I really want this piece to leave you with is simpler than the math:
That tax adds up to billions. It does not have to.
David Hurley is the founder of Plasmate Labs. Previously, he founded Mautic, the world's first open source marketing automation platform. He writes at dbhurley.com/blog and publishes research at dbhurley.com/papers.

What I Learned Building Mautic That Applies to the Agentic Web

David Hurley — Mon, 30 Mar 2026 02:19:34 +0000

In 2014, I started building Mautic. It became the world's first open source marketing automation platform. Acquia acquired it in 2019. Over those five years, I learned things about building infrastructure for a new class of consumer that I did not fully appreciate at the time.

Now I am building Plasmate, an open source headless browser that compiles web pages into structured representations for AI agents. The domain is completely different. The technology is different. The users are different. But the structural problems, the adoption dynamics, and the strategic patterns are remarkably similar.

This is what I learned building Mautic that applies directly to what I am building now.

Lesson 1: Every new consumer class needs its own infrastructure

When Mautic started, marketing automation existed. Marketo, HubSpot, Pardot, and Eloqua all served the enterprise. But they were closed, expensive, and inaccessible to the vast majority of organizations that needed marketing infrastructure.

The insight was not that marketing automation was a new idea. The insight was that a large class of consumers (small and mid-market organizations, developers, agencies, nonprofits) had no infrastructure designed for their constraints. They needed marketing automation that was open, self-hostable, extensible, and free to start with.

The same pattern is playing out with AI agents today. Web browsing infrastructure exists. Chrome, Playwright, Puppeteer, and Selenium all work. But they were designed for humans and human-oriented testing. AI agents are a different consumer class with different constraints: they need structured output (not pixels), token efficiency (not visual fidelity), semantic understanding (not DOM selectors), and speed at scale (not single-session debugging).

The existing tools technically work for agents the same way enterprise marketing platforms technically worked for small businesses. But "technically works" and "designed for" are very different things. When I built Mautic, the opportunity was not inventing marketing automation. It was building marketing automation for the consumer class that existing tools underserved. With Plasmate, the opportunity is the same: building web browsing infrastructure for the consumer class that existing tools underserve.

Lesson 2: Open source is a distribution strategy, not a business model

One of the most important things I learned at Mautic is that open source is how you get adopted, not how you get paid. These are related but distinct.

Mautic grew to millions of installations because it was free and open. Organizations could try it without a sales call, deploy it without a procurement process, and extend it without permission. That distribution would have been impossible as a closed-source product competing against well-funded incumbents.

But the business model was never "sell open source software." It was: build an open core that earns trust and adoption, then offer commercial services (hosting, support, premium features, integrations) to organizations that want operational convenience on top of the open foundation.

I am applying the same structure to Plasmate. The compiler is open source (Apache 2.0). The SOM specification is open. The MCP server, browser extension, LangChain integration, LlamaIndex integration, and all SDKs are open. This is the distribution layer. Every developer who installs Plasmate and every agent framework that integrates SOM expands the ecosystem without a dollar of marketing spend.

The commercial layer sits on top: SOM Cache (a shared semantic CDN for agents), Fleet Orchestration (managed browser infrastructure at scale), and enterprise services. These are operational conveniences that organizations will pay for once the open source foundation has earned their trust.

The lesson from Mautic is that this sequence matters. Open source first, commercial second. Trust first, revenue second. If you reverse the order, you get neither.

Lesson 3: Standards and specifications matter more than implementations

Mautic did not just build software. It contributed to the broader marketing technology ecosystem by establishing patterns that other tools adopted. The API structures, the campaign builder paradigm, the contact lifecycle model, the integration framework. These became conventions that outlasted any single implementation.

I underestimated the importance of this at the time. I thought the code was the product. In retrospect, the specifications and patterns were at least as important. They shaped how an entire category of software was built, including by competitors.

With Plasmate, I am investing in the specification layer from day one. The SOM Spec v1.0 is published as a formal document with a JSON Schema. The Agent Web Protocol (AWP) is a full protocol specification, not just an implementation. The robots.txt extension proposal follows RFC conventions. The .well-known/som.json convention is designed as a web standard, not a Plasmate feature.

This is deliberate. If SOM succeeds only as a Plasmate output format, it has failed. It needs to become a convention that any tool can produce and any agent can consume. That means the specification must be clear, open, and implementation-independent.

I am also participating in the W3C Web Content for Browser and AI Community Group because I learned from Mautic that standards bodies matter even when they move slowly. Getting a seat at the table early means you can shape the conversation rather than react to it.

Lesson 4: The community decides whether you succeed

Mautic's community was the product's greatest asset and its most demanding stakeholder. Contributors built integrations we never planned. Users found use cases we never imagined. And the community's expectations for openness, transparency, and quality pushed us to be better than we would have been alone.

The same dynamics apply to Plasmate, but with an important difference. Mautic's community was primarily marketers and developers. Plasmate's community will be primarily AI agent developers and, eventually, web publishers. These are different audiences with different expectations.

Agent developers care about reliability, speed, and token efficiency. They will adopt SOM if it measurably outperforms the alternatives on the metrics they track (cost per page, latency, task accuracy). The community will form around benchmark results and practical performance, not ideology.

Publishers care about control, cost reduction, and future-proofing. They will adopt SOM-first serving if it reduces their infrastructure load from agent traffic and gives them control over how agents interpret their content. The community on this side will form around economic incentives.

Building for both audiences simultaneously is harder than building for one. But the two-sided nature of the agentic web (agents consume, publishers serve) means that adoption on either side reinforces the other.

Lesson 5: Timing is the variable you control least and matters most

Mautic launched in 2014. Marketing automation had existed since the mid-2000s. We were not early to the category. We were early to the open source version of the category. That timing turned out to be exactly right: the market was mature enough that organizations understood the value of marketing automation, but the open source alternative did not yet exist.

If we had launched in 2008, the market would not have been ready. If we had launched in 2018, someone else would have done it first.

I think about timing constantly with Plasmate. AI agents are browsing the web right now. Cloudflare reports that AI user-action crawling increased by over 15x in 2025 alone. The problem (agents consuming raw HTML) is real and growing. But the solution (structured representations as a web standard) requires adoption by both agent frameworks and web publishers. That adoption takes time.

The question is whether we are building too early (before the ecosystem is ready to adopt) or at the right moment (when the pain is acute enough to drive change). Based on the trajectory of agent traffic and the increasing frustration of both agent developers (token costs) and publishers (crawl load without referral traffic), I believe the timing is right.

But I learned from Mautic that you cannot force timing. You can only be ready when the moment arrives. That means having the specification published, the tools working, the integrations built, and the community seeded before the inflection point. When a major agent framework decides to adopt structured web representations as a default, we need to be the obvious choice.

Lesson 6: Acquisition is not the goal, but you should build as if it might happen

Acquia acquired Mautic in 2019. That outcome was not the goal when we started. But the way we built Mautic (clean architecture, documented APIs, extensible framework, active community, clear licensing) made the acquisition possible and relatively smooth.

I am building Plasmate with the same discipline. Clean separation between the open source compiler and the commercial services. Well-documented specifications. Clear licensing (Apache 2.0 for everything open). A codebase that another organization could adopt, extend, or integrate without depending on us.

This is not because I expect or want an acquisition. It is because building with that level of rigor produces better software. If the code is clean enough for a stranger to understand, it is clean enough for your own team to maintain. If the specifications are clear enough for a competitor to implement, they are clear enough for your community to adopt.

The pattern

Looking back at a decade of building, I see a pattern that repeats:

A new consumer class emerges that existing infrastructure underserves.
Someone builds purpose-built infrastructure for that consumer class.
Open source distribution earns trust and adoption faster than closed alternatives.
Specifications and standards outlast implementations.
Communities form around measurable performance, not ideology.
Timing determines whether you are a pioneer (too early), a leader (right time), or a follower (too late).

With Mautic, the new consumer was the small and mid-market organization that needed marketing automation. With Plasmate, the new consumer is the AI agent that needs structured web content.

The domain is different. The pattern is identical. And the pattern works.

David Hurley is the founder of Plasmate Labs and the creator of Mautic, the world's first open source marketing automation platform. He writes about infrastructure, open source, and the agentic web at dbhurley.com. Research papers are available at dbhurley.com/papers.

Why I'm Building a Browser No Human Will Ever Use

David Hurley — Mon, 30 Mar 2026 02:14:05 +0000

I am building a browser that will never render a single pixel.

No address bar. No tabs. No bookmarks. No window at all. Nobody will ever "open" Plasmate the way they open Chrome or Firefox or Safari. It has no visual interface because its consumer has no eyes.

This sounds absurd until you think about who is actually browsing the web in 2026.

The absurd premise that turned out to be obvious

When I first described Plasmate to other developers, the reaction was usually some version of: "Why not just use Playwright? Or Puppeteer? Or headless Chrome?" These are reasonable questions. They have headless modes. They can fetch pages and return HTML. The tools exist.

But consider what headless Chrome actually does when you ask it to "browse" a page. It launches a full rendering engine. It constructs a layout tree. It calculates pixel positions for every element. It composites layers. It rasterizes text into bitmap glyphs. It computes box shadows, border radii, gradient interpolations, and subpixel antialiasing. Then, if you are using it for an AI agent, you throw all of that away and extract the text.

This is like hiring a portrait painter to read you the newspaper. The painter's skills are extraordinary and completely irrelevant to the task.

Chrome was designed to turn HTML into pixels for human eyes. Every architectural decision, every optimization, every feature in Chrome serves that purpose. When you repurpose it for AI agents, you are paying the full cost of pixel rendering for a consumer that will never see a pixel.

The cost is not trivial. Chrome uses 200MB to 500MB of memory per page. It takes 1 to 3 seconds to render a complex page. At scale (an agent system monitoring hundreds of pages), this translates to gigabytes of RAM and minutes of compute spent on rendering that serves no purpose.

The question is not "can Chrome work for agents?" The answer is obviously yes. The question is "should agents pay the cost of pixel rendering when they need text comprehension?" The answer is obviously no.

That is why I am building a browser designed for a consumer that has no eyes.

What agents actually need from a browser

When I sat down to design Plasmate, I started by listing what an AI agent actually needs from a web browsing tool. The list was surprisingly short:

Fetch the page. Make an HTTP request, follow redirects, handle TLS, manage cookies. This is table stakes.

Execute JavaScript. Modern web pages are applications. The HTML document that arrives over the network is often a shell that loads and executes JavaScript to produce the actual content. An agent browser must execute this JavaScript to see what a human would see.

Understand the structure. This is where every existing tool falls short. The agent does not need a pixel grid. It needs to know: what regions exist on this page (navigation, main content, sidebar, footer)? What elements are in each region? What type is each element (heading, paragraph, link, button, form field)? What can the agent do with each element (click, type, select, toggle)?

Produce structured output. The agent's downstream consumer is a language model. The output must be structured, typed, and token-efficient. Raw HTML fails all three criteria.

Notice what is not on this list: rendering pixels, computing layout, displaying fonts, animating transitions, playing audio or video, painting gradients, or any of the hundreds of other things a visual browser does. These capabilities represent the majority of Chrome's complexity and the majority of its resource consumption.

The architecture: what Plasmate actually does

Plasmate is written in Rust. This was a deliberate choice for performance and memory safety, but also because Rust's ecosystem has excellent HTML parsing (html5ever, the same parser Firefox uses) and a mature V8 binding for JavaScript execution.

The pipeline has five stages:

Stage 1: Network fetch

An HTTP client fetches the page with full TLS, redirect, and cookie support. This is straightforward and shared with every other browser. The difference is that Plasmate's HTTP client does not load CSS files, font files, or image files. It loads the HTML document and JavaScript files only. Everything visual is irrelevant.

Stage 2: JavaScript execution

The HTML document is parsed with html5ever, and JavaScript is executed via V8. This is necessary because many pages generate their content dynamically. React, Vue, Angular, and Next.js applications produce an empty <div id="root"></div> in the initial HTML and populate it entirely through JavaScript.

JavaScript execution is where most of the complexity lives. V8 is a large, sophisticated engine. But even V8's execution is faster and lighter than Chrome's full pipeline because we skip the rendering, layout, and painting phases that normally follow DOM construction.

In v0.5.0, we added ICU data loading for Intl API support and raised script fetch limits to handle large SPA bundles (up to 3MB per script, 10MB total). We also added graceful degradation: when JavaScript execution fails, Plasmate compiles the pre-JavaScript HTML and returns partial SOM. Partial structured output is better than no output.

Stage 3: Region detection

Once the DOM is constructed (with JavaScript applied), Plasmate identifies semantic regions on the page. The detection uses a precedence chain:

First, ARIA roles. If an element has role="navigation" or role="main", that is definitive.

Second, HTML5 landmark elements. <nav>, <main>, <aside>, <header>, <footer>, <dialog>, and <form> map directly to region roles.

Third, class and ID heuristics. A <div class="main-content"> is likely the main region. A <div id="sidebar"> is likely an aside.

Fourth, link density analysis. A container with many links and few other elements is likely navigation, even without explicit markup.

Fifth, content heuristics. A container with copyright notices and privacy links is likely a footer.

Sixth, fallback. Anything not assigned to a specific region goes into a generic "content" region.

This detection produces a structured map of the page that no flat text extraction can replicate. An agent reading Plasmate output can go directly to the main region without scanning the entire page.

Stage 4: Element classification

Within each region, elements are classified by semantic role. Plasmate recognizes 15 element types: link, button, text_input, textarea, select, checkbox, radio, heading, image, list, table, paragraph, section, separator, and details (disclosure widgets).

Each element receives:

A stable identifier derived from SHA-256 hashing of the element's origin, role, accessible name, and DOM path. The same element on the same page always produces the same ID.

An html_id field preserving the original HTML id attribute (when present), enabling agents to resolve back to the DOM for interaction.

An actions array declaring what the agent can do: click, type, clear, select, or toggle.

An attrs object with role-specific data: href for links, level for headings, options for selects, headers and rows for tables, open state and summary text for details widgets.

An aria sub-object capturing dynamic widget state: expanded, selected, checked, disabled, current, pressed, hidden.

Semantic hints inferred from CSS class names: "primary," "danger," "disabled," "active." These are not visual styles but semantic signals that agents can use to understand element importance.

Stage 5: Serialization

The classified regions and elements are serialized as JSON conforming to the SOM Spec v1.0. The output is deterministic: the same page always produces the same JSON (modulo dynamic content changes).

The entire pipeline, from HTML string to SOM JSON, takes microseconds for the compilation step. The bottleneck is network fetch and JavaScript execution, not SOM compilation.

What this means in practice

The practical impact of this architecture is measurable. Across 50 real websites in our WebTaskBench evaluation:

Token reduction. SOM uses 8,301 tokens per page on average versus 33,181 for raw HTML. That is a 4x reduction. For navigation-heavy pages, the ratio reaches 5.4x. For adversarial pages (heavy ads, cookie banners, JavaScript noise), it reaches 6.0x.

Latency improvement. On Claude Sonnet 4, SOM is the fastest representation at 8.5 seconds average, compared to 16.2 seconds for HTML and 25.2 seconds for Markdown. Structured input reduces model reasoning time even compared to smaller unstructured input.

Memory efficiency. Plasmate uses approximately 30MB for 100 pages. Headless Chrome uses approximately 20GB for the same workload. The difference is the rendering pipeline that Plasmate skips entirely.

Speed. With daemon mode (a persistent process that keeps the browser warm), subsequent fetches complete in 200 to 400 milliseconds. Cold start is 2 to 3 seconds. This is competitive with simple HTTP-fetch-plus-readability tools while providing dramatically richer output.

Why not just use Markdown?

I get this question frequently, and it deserves a thorough answer.

Markdown extraction (via tools like Jina Reader, Firecrawl, or basic readability libraries) is the most common alternative to raw HTML for agent consumption. It works well for text extraction tasks. In our benchmark, Markdown uses 4,542 tokens per page, which is smaller than SOM's 8,301.

But Markdown has a fundamental limitation: it cannot represent interactivity. A Markdown document cannot tell an agent which text is a button, which is a link, which is a form field, or what actions are available. For an agent that needs to read an article and summarize it, this does not matter. For an agent that needs to fill a form, navigate a multi-step workflow, or click through search results, Markdown is blind.

The latency data reinforces this. On Claude, Markdown is slower than SOM despite being smaller. Our interpretation is that Claude spends additional reasoning time trying to reconstruct page structure from ambiguous text. When the task requires understanding what is interactive and what is not, the model has to guess from context rather than reading explicit declarations.

SOM occupies the middle ground: smaller than HTML, structured unlike Markdown, and fast for models to process because the semantic work is done at compile time rather than inference time.

The decisions that shaped the architecture

Several architectural decisions in Plasmate deserve explanation because they diverge from what most people would expect from a browser project.

Why Rust?

The obvious choice for a browser project is C++ (what Chrome and Firefox use) or JavaScript/TypeScript (what most developer tools use). Rust is unusual.

I chose Rust for three reasons. First, memory safety without garbage collection. A browser engine processes untrusted input (HTML, JavaScript, CSS) from arbitrary websites. Memory safety bugs in browser engines are the largest category of security vulnerabilities in Chrome and Firefox. Rust eliminates entire classes of these bugs at compile time.

Second, performance. The SOM compilation pipeline processes every element in the DOM tree, computes SHA-256 hashes for stable IDs, runs heuristic analysis on class names and content patterns, and serializes the result as JSON. In Rust, this entire pipeline runs in microseconds per page. In a garbage-collected language, the memory allocation patterns would introduce pauses.

Third, Rust's ecosystem has exactly the libraries needed. html5ever (the HTML parser from Mozilla's Servo project) and the V8 crate (Rust bindings for Google's JavaScript engine) provide production-quality foundations. The serde library provides zero-cost JSON serialization. These are not wrappers or bindings with impedance mismatches. They are native Rust libraries designed for high-performance text processing.

Why not use an existing rendering engine?

Blink (Chrome's rendering engine) and Gecko (Firefox's) are extraordinary pieces of engineering. They handle every edge case in CSS layout, every quirk in the HTML specification, and every performance optimization needed for smooth visual rendering.

But they are designed around a fundamental assumption: the output is a pixel grid on a screen. Every data structure, every caching strategy, every parallelization decision in these engines optimizes for that output target. Repurposing them for structured text output means carrying all of that complexity while using none of it.

Plasmate uses html5ever for DOM construction and V8 for JavaScript execution, but it builds its own pipeline for everything after that. Region detection, element classification, stable ID generation, ARIA state capture, and SOM serialization are all custom. This is not because existing engines cannot do these things. It is because doing them well requires different architectural assumptions than visual rendering demands.

Why is the output JSON and not something else?

SOM is serialized as JSON because JSON is the lingua franca of agent frameworks. Every programming language can parse JSON. Every LLM API accepts text that includes JSON. Every agent framework stores and transmits structured data as JSON.

We considered alternatives. Protocol Buffers would be smaller on the wire but harder for agents to read inline. XML would be semantically richer but token-heavy. YAML would be human-readable but ambiguous. MessagePack would be compact but binary.

JSON won because the primary consumer is a language model reading text. The JSON representation of a SOM document is directly readable by the model as context. No deserialization step is needed. The model sees the structure, the types, and the values in the same stream.

The contrarian bet

Building Plasmate required a contrarian belief: that the web needs a new browser, not a new wrapper around Chrome.

The wrapper approach is popular. Playwright, Puppeteer, Browserbase, Steel, and many others wrap Chrome and add convenience APIs on top. This works, and it is a reasonable strategy for tools that need pixel-perfect browser fidelity.

But wrapping Chrome means accepting Chrome's architectural assumptions. You pay for pixel rendering even when you do not need pixels. You accept Chrome's memory model, Chrome's process architecture, Chrome's security sandbox, and Chrome's update cycle. These are excellent design decisions for a visual browser. They are unnecessary constraints for an agent browser.

Plasmate does not wrap Chrome. It uses the same HTML parser (html5ever, from Mozilla) and the same JavaScript engine (V8, from Google), but it constructs its own pipeline around them. The pipeline is designed for a specific consumer (AI agents) with specific needs (structured output, token efficiency, semantic understanding) that Chrome's pipeline was never intended to serve.

This is the same bet I made with Mautic. The marketing automation tools existed (Marketo, HubSpot, Pardot). But they were designed for a different consumer with different constraints. Building for the underserved consumer, rather than wrapping the existing tools, produced a fundamentally better product for that audience.

What comes next

Plasmate today handles the compilation side: HTML in, SOM out. The next challenges are:

JavaScript coverage. Some heavily dynamic sites still fail during JavaScript execution. Khan Academy, certain React applications with complex hydration, and sites with aggressive anti-bot measures. Each failure is a reason for an agent to fall back to a simpler tool. Closing these gaps is engineering work, not architectural work, and it is ongoing.

WASM distribution. We recently published the SOM compiler as a WebAssembly module (npm install plasmate-wasm). This allows SOM compilation in any JavaScript runtime without a native binary. It is the first step toward making Plasmate's compilation available everywhere JavaScript runs: serverless functions, edge workers, browsers, CI pipelines.

Publisher adoption. The plasmate compile command accepts HTML from files or stdin without any network requests. Publishers can integrate SOM generation into their build pipelines and serve structured representations alongside HTML. Six properties already do this. Growing that number is as important as improving the compiler.

Standards. The SOM Spec, the Agent Web Protocol, and the robots.txt extension proposal are all published openly. We are participating in the W3C Community Group for Web Content and Browser AI. If SOM becomes a web standard rather than a Plasmate feature, the entire ecosystem benefits.

The point

I am building a browser no human will ever use because humans are no longer the only consumers of the web. AI agents browse billions of pages per day, and every one of those pages is served in a format designed for human eyes. The waste is staggering: we estimated in a recent paper that HTML presentation noise costs the agent ecosystem $1 billion to $5 billion per year in unnecessary token consumption.

The solution is not to make agents better at reading HTML. The solution is to give agents a format designed for them, the same way the web gave search engines sitemaps and applications gave consumers APIs.

Plasmate is that format's compiler. A browser built for a consumer that will never see a pixel, because it does not need to.

The Web's Fourth State

David Hurley — Mon, 30 Mar 2026 02:13:56 +0000

Plasmate SOM Compiler Now Available as WebAssembly

David Hurley — Sun, 29 Mar 2026 17:07:59 +0000

One of the most common objections to adopting Plasmate has been binary distribution. The full Plasmate CLI is a compiled Rust binary that includes a browser engine for JavaScript execution and page rendering. It works well, but it requires platform-specific binaries and cannot run in environments that restrict native code execution (serverless functions, edge workers, browser contexts).

The SOM compiler itself has no such limitation. It is pure computation: parse HTML with html5ever, walk the DOM tree, identify semantic regions, classify elements, generate stable IDs, serialize JSON. No system calls, no network, no file system. Just a function that takes a string and returns a string.

Today we are publishing that function as WebAssembly.

Install and use

npm install plasmate-wasm

const { compile } = require('plasmate-wasm');

const html = '<html><body><nav><a href="/about">About</a></nav><main><h1>Hello</h1><p>World</p></main></body></html>';
const som = JSON.parse(compile(html, 'https://example.com'));

console.log(som.title);              // "Hello"
console.log(som.regions.length);      // 2 (navigation + main)
console.log(som.meta.element_count);  // 5

The compile function takes two arguments: an HTML string and a URL for stable ID generation. No network request is made. It returns a SOM JSON string.

Where this runs

The WASM module works in any JavaScript runtime that supports WebAssembly:

Node.js, Deno, Bun: Import as a regular npm package. The WASM binary is loaded automatically.

Browsers: Use the ESM build (available in the pkg-web directory). Useful for client-side SOM generation in developer tools or browser extensions.

Serverless (AWS Lambda, Vercel Functions, Cloudflare Workers): The 380KB gzipped package size fits within typical size limits. No native binary installation needed during deployment.

Edge (Cloudflare Workers, Deno Deploy, Vercel Edge Runtime): WASM is a first-class citizen in edge runtimes. The compiler initializes in milliseconds.

Size and performance

The WASM binary is 864KB uncompressed, 380KB after gzip. For comparison, the full Plasmate binary is approximately 30MB.

Compilation speed is comparable to the native binary for the compile step itself. The native binary is faster overall because it avoids WASM interpreter overhead, but the difference is small (microseconds per page for the compile step). The bottleneck in the full pipeline has always been fetching and JavaScript execution, not SOM compilation.

When to use WASM vs the full CLI

Scenario	Use
Fetching live pages with JS execution	Full CLI (`plasmate fetch`) or daemon mode
Compiling HTML you already have	WASM (`plasmate-wasm`) or CLI (`plasmate compile`)
Serverless or edge deployment	WASM
Browser extension or developer tool	WASM
CI/CD pipeline (HTML already rendered)	Either (WASM avoids binary installation)
Publisher build pipeline	Either (WASM is simpler to integrate)

The key distinction: if you need to fetch a live page and execute its JavaScript, you need the full CLI. If you already have the HTML (from your CMS, build pipeline, or another HTTP client), the WASM compiler does everything you need with zero native dependencies.

Publisher integration example

A static site generator that produces SOM during the build:

const { compile } = require('plasmate-wasm');
const fs = require('fs');
const path = require('path');

// After Hugo/Astro/Next.js has rendered HTML to disk
const htmlDir = './public';
const somDir = './public/.well-known/som';

fs.mkdirSync(somDir, { recursive: true });

for (const file of fs.readdirSync(htmlDir, { recursive: true })) {
  if (!file.endsWith('.html')) continue;

  const html = fs.readFileSync(path.join(htmlDir, file), 'utf-8');
  const slug = file.replace(/\/index\.html$/, '').replace(/\.html$/, '');
  const url = `https://mysite.com/${slug}`;

  const somJson = compile(html, url);
  const outPath = path.join(somDir, `${slug || 'index'}.json`);

  fs.mkdirSync(path.dirname(outPath), { recursive: true });
  fs.writeFileSync(outPath, somJson);
}

This runs entirely at build time with no network requests and no native binary. The SOM files are deployed alongside the HTML as static assets.

Cloudflare Worker example

An edge function that compiles HTML to SOM on the fly:

import { compile } from 'plasmate-wasm';

export default {
  async fetch(request) {
    const url = new URL(request.url);
    const targetUrl = url.searchParams.get('url');

    if (!targetUrl) {
      return new Response('Missing url parameter', { status: 400 });
    }

    // Fetch the page
    const resp = await fetch(targetUrl);
    const html = await resp.text();

    // Compile to SOM
    const somJson = compile(html, targetUrl);

    return new Response(somJson, {
      headers: { 'Content-Type': 'application/json' }
    });
  }
};

This is a lightweight SOM proxy running at the edge. No Plasmate binary deployment needed. The WASM module handles everything.

What is next

We plan to publish the WASM compiler to additional package registries (Deno modules, JSR) and create framework-specific wrappers for popular static site generators. The long-term goal is to make SOM compilation available everywhere JavaScript runs.

For sites that require JavaScript execution to render content (SPAs, dynamically loaded data), the full CLI and daemon mode remain necessary. We are actively improving JS coverage for those cases.

GitHub | npm | Full CLI | Documentation

HTML vs Markdown vs SOM: Which Format Should Your AI Agent Use?

David Hurley — Sat, 28 Mar 2026 13:39:34 +0000

Every AI agent that browses the web faces the same question: how do you represent a web page to a language model?

The default answer, raw HTML, is expensive and slow. A typical page dumps 30,000+ tokens into your context window, most of it CSS classes and layout divs. But what are the actual alternatives? And do they work?

We ran WebTaskBench, 100 tasks across GPT-4o and Claude Sonnet 4, to find out. The results surprised us.

The Three Representations

When an agent needs to understand a web page, there are three common approaches:

1. Raw HTML

The DOM as-is. Every <div>, every class="sc-1234 flex items-center gap-2", every inline script. This is what most agents send today.

<div class="sc-1234 flex items-center gap-2 px-4 py-2">
  <a href="/about" class="text-blue-500 hover:underline
     font-medium tracking-tight text-sm">About</a>
  <span class="text-gray-400">|</span>
  <a href="/pricing" class="text-blue-500 hover:underline
     font-medium tracking-tight text-sm">Pricing</a>
</div>

Pros: Complete fidelity to the DOM. No information lost.

Cons: 80-95% of tokens are noise (styling, scripts, tracking). Expensive. Slow.

2. Markdown

Strip the HTML to readable text, preserving structure through Markdown conventions. This is what tools like Jina Reader and many scraping libraries produce.

[About](/about) | [Pricing](/pricing)

Pros: Dramatically fewer tokens. Human-readable.

Cons: Loses interactive elements. No way to know what's clickable. Navigation tasks become guesswork.

3. SOM (Semantic Object Model)

A structured JSON representation that preserves meaning and interactivity while stripping presentation noise. Each element includes its semantic role and available actions.

{
  "role": "navigation",
  "elements": [
    { "role": "link", "text": "About", "id": "e_a1b2c3", "attrs": {"href": "/about"}, "actions": ["click"] },
    { "role": "link", "text": "Pricing", "id": "e_d4e5f6", "attrs": {"href": "/pricing"}, "actions": ["click"] }
  ]
}

Pros: Minimal tokens. Preserves interactivity. Clear semantic roles.

Cons: Requires a SOM-aware fetcher (like Plasmate).

Token Cost Comparison

We measured input tokens across 50 web pages (news sites, documentation, e-commerce, government sites, social platforms). The differences are stark:

Token Cost Comparison:

HTML: 33,181 average input tokens (1.0x)
SOM: 8,301 average input tokens (4.0x fewer)
Markdown: 4,542 average input tokens (7.3x fewer)

Markdown wins on raw token count, it strips everything. But tokens aren't the whole story.

Cost Per 1,000 Pages (at $3/M input tokens)

Cost Per 1,000 Pages (at $3/M input tokens):

HTML: $99.54 (baseline)
SOM: $24.90 (75% savings)
Markdown: $13.63 (86% savings)

If you're just extracting text, Markdown is cheaper. But if your agent needs to interact with pages, click buttons, fill forms, navigate, Markdown falls apart.

The Latency Surprise

Here's where it gets interesting. We expected Markdown to be fastest (fewest tokens = fastest inference). That's true for GPT-4o:

GPT-4o Latency (seconds)

GPT-4o Latency (seconds):

HTML: 2.7s
Markdown: 1.9s
SOM: 1.4s (fastest)

SOM beats both. Why? Two reasons:

Structured input parses faster. JSON with clear roles lets the model skip the "what is this?" step.
Less ambiguity = shorter reasoning chains. When a link is explicitly marked "role": "link", "actions": ["click"], the model doesn't need to infer interactivity from context.

Claude Sonnet 4 Latency (seconds)

Claude Sonnet 4 Latency (seconds):

HTML: 16.2s
Markdown: 25.2s (slowest)
SOM: 8.5s (fastest)

Wait, Markdown is slower than HTML on Claude? Yes. And SOM is nearly 3x faster than Markdown.

Claude appears to struggle with ambiguous Markdown when the task requires understanding page structure. The model spends more time reasoning about what elements are clickable, what actions are available, and how to express those actions. With SOM, that information is explicit.

Category Breakdown

Not all tasks are equal. We tested extraction, comparison, navigation, summarization, and adversarial tasks (noisy pages with heavy chrome).

HTML/SOM Token Ratio by Category

HTML/SOM Token Ratio by Category:

Extraction: 2.2x (SOM wins, but margin is smaller)
Comparison: 3.9x (Multi-item pages benefit from structure)
Summarization: 3.9x (Similar to comparison)
Navigation: 5.4x (Interactivity data is dense in SOM)
Adversarial: 6.0x (Anti-bot clutter inflates HTML massively)

For adversarial pages (cookie banners, heavy JavaScript, ad-filled layouts), HTML explodes with noise while SOM stays lean. The 6x ratio means you're paying 6x more for HTML on the hardest pages.

Where Markdown Fails

Markdown works great for "read this article and summarize it." It breaks down for:

Form filling: Markdown can't represent input fields, dropdowns, or submit buttons
Navigation: No reliable way to know which text is a clickable link vs decorative
Stateful interactions: Multi-step flows (add to cart, checkout) require element references
Dynamic content: JavaScript-rendered content often doesn't survive text conversion

When to Use What

Use Markdown when:

Pure text extraction (summarize this article)
No interaction needed
Budget is the only constraint
You control the source (your own docs, known-good pages)

Use SOM when:

Agents need to click, type, or navigate
Multi-step workflows
Unknown or adversarial pages
Latency matters (SOM is fastest on both models)
You want consistent structure across diverse sites

Use HTML when:

You need pixel-perfect DOM fidelity
Building a browser automation tool that maps directly to CSS selectors
Debugging what the page actually contains

The honest recommendation: default to SOM unless you have a specific reason not to. It's faster, cheaper than HTML, and handles interactive tasks that Markdown can't.

Getting Started with Plasmate

Plasmate is the reference implementation of SOM. Three ways to use it:

1. CLI

npm install -g plasmate
plasmate fetch https://example.com

2. MCP Server (Claude Desktop / Cursor)

{
  "mcpServers": {
    "plasmate": {
      "command": "npx",
      "args": ["-y", "plasmate", "mcp"]
    }
  }
}

3. SOM Cache API

import requests

response = requests.get(
    "https://cache.plasmate.app/v1/som",
    params={"url": "https://news.ycombinator.com"},
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
som = response.json()

For authenticated browsing (sites that require login), see the Authenticated Browsing Guide.

The Data

All numbers in this post come from WebTaskBench, an open benchmark of 100 web tasks across 50 real-world URLs. You can run it yourself and reproduce every number.