Every AI agent that browses the web faces the same question: how do you represent a web page to a language model?
The default answer, raw HTML, is expensive and slow. A typical page dumps 30,000+ tokens into your context window, most of it CSS classes and layout divs. But what are the actual alternatives? And do they work?
We ran WebTaskBench, 100 tasks across GPT-4o and Claude Sonnet 4, to find out. The results surprised us.
The Three Representations
When an agent needs to understand a web page, there are three common approaches:
1. Raw HTML
The DOM as-is. Every <div>, every class="sc-1234 flex items-center gap-2", every inline script. This is what most agents send today.
<div class="sc-1234 flex items-center gap-2 px-4 py-2">
<a href="/about" class="text-blue-500 hover:underline
font-medium tracking-tight text-sm">About</a>
<span class="text-gray-400">|</span>
<a href="/pricing" class="text-blue-500 hover:underline
font-medium tracking-tight text-sm">Pricing</a>
</div>
Pros: Complete fidelity to the DOM. No information lost.
Cons: 80-95% of tokens are noise (styling, scripts, tracking). Expensive. Slow.
2. Markdown
Strip the HTML to readable text, preserving structure through Markdown conventions. This is what tools like Jina Reader and many scraping libraries produce.
[About](/about) | [Pricing](/pricing)
Pros: Dramatically fewer tokens. Human-readable.
Cons: Loses interactive elements. No way to know what's clickable. Navigation tasks become guesswork.
3. SOM (Semantic Object Model)
A structured JSON representation that preserves meaning and interactivity while stripping presentation noise. Each element includes its semantic role and available actions.
{
"role": "navigation",
"elements": [
{ "role": "link", "text": "About", "id": "e_a1b2c3", "attrs": {"href": "/about"}, "actions": ["click"] },
{ "role": "link", "text": "Pricing", "id": "e_d4e5f6", "attrs": {"href": "/pricing"}, "actions": ["click"] }
]
}
Pros: Minimal tokens. Preserves interactivity. Clear semantic roles.
Cons: Requires a SOM-aware fetcher (like Plasmate).
Token Cost Comparison
We measured input tokens across 50 web pages (news sites, documentation, e-commerce, government sites, social platforms). The differences are stark:
Token Cost Comparison:
- HTML: 33,181 average input tokens (1.0x)
- SOM: 8,301 average input tokens (4.0x fewer)
- Markdown: 4,542 average input tokens (7.3x fewer)
Markdown wins on raw token count, it strips everything. But tokens aren't the whole story.
Cost Per 1,000 Pages (at $3/M input tokens)
Cost Per 1,000 Pages (at $3/M input tokens):
- HTML: $99.54 (baseline)
- SOM: $24.90 (75% savings)
- Markdown: $13.63 (86% savings)
If you're just extracting text, Markdown is cheaper. But if your agent needs to interact with pages, click buttons, fill forms, navigate, Markdown falls apart.
The Latency Surprise
Here's where it gets interesting. We expected Markdown to be fastest (fewest tokens = fastest inference). That's true for GPT-4o:
GPT-4o Latency (seconds)
GPT-4o Latency (seconds):
- HTML: 2.7s
- Markdown: 1.9s
- SOM: 1.4s (fastest)
SOM beats both. Why? Two reasons:
- Structured input parses faster. JSON with clear roles lets the model skip the "what is this?" step.
-
Less ambiguity = shorter reasoning chains. When a link is explicitly marked
"role": "link", "actions": ["click"], the model doesn't need to infer interactivity from context.
Claude Sonnet 4 Latency (seconds)
Claude Sonnet 4 Latency (seconds):
- HTML: 16.2s
- Markdown: 25.2s (slowest)
- SOM: 8.5s (fastest)
Wait, Markdown is slower than HTML on Claude? Yes. And SOM is nearly 3x faster than Markdown.
Claude appears to struggle with ambiguous Markdown when the task requires understanding page structure. The model spends more time reasoning about what elements are clickable, what actions are available, and how to express those actions. With SOM, that information is explicit.
Category Breakdown
Not all tasks are equal. We tested extraction, comparison, navigation, summarization, and adversarial tasks (noisy pages with heavy chrome).
HTML/SOM Token Ratio by Category
HTML/SOM Token Ratio by Category:
- Extraction: 2.2x (SOM wins, but margin is smaller)
- Comparison: 3.9x (Multi-item pages benefit from structure)
- Summarization: 3.9x (Similar to comparison)
- Navigation: 5.4x (Interactivity data is dense in SOM)
- Adversarial: 6.0x (Anti-bot clutter inflates HTML massively)
For adversarial pages (cookie banners, heavy JavaScript, ad-filled layouts), HTML explodes with noise while SOM stays lean. The 6x ratio means you're paying 6x more for HTML on the hardest pages.
Where Markdown Fails
Markdown works great for "read this article and summarize it." It breaks down for:
- Form filling: Markdown can't represent input fields, dropdowns, or submit buttons
- Navigation: No reliable way to know which text is a clickable link vs decorative
- Stateful interactions: Multi-step flows (add to cart, checkout) require element references
- Dynamic content: JavaScript-rendered content often doesn't survive text conversion
When to Use What
Use Markdown when:
- Pure text extraction (summarize this article)
- No interaction needed
- Budget is the only constraint
- You control the source (your own docs, known-good pages)
Use SOM when:
- Agents need to click, type, or navigate
- Multi-step workflows
- Unknown or adversarial pages
- Latency matters (SOM is fastest on both models)
- You want consistent structure across diverse sites
Use HTML when:
- You need pixel-perfect DOM fidelity
- Building a browser automation tool that maps directly to CSS selectors
- Debugging what the page actually contains
The honest recommendation: default to SOM unless you have a specific reason not to. It's faster, cheaper than HTML, and handles interactive tasks that Markdown can't.
Getting Started with Plasmate
Plasmate is the reference implementation of SOM. Three ways to use it:
1. CLI
npm install -g plasmate
plasmate fetch https://example.com
2. MCP Server (Claude Desktop / Cursor)
{
"mcpServers": {
"plasmate": {
"command": "npx",
"args": ["-y", "plasmate", "mcp"]
}
}
}
3. SOM Cache API
import requests
response = requests.get(
"https://cache.plasmate.app/v1/som",
params={"url": "https://news.ycombinator.com"},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
som = response.json()
For authenticated browsing (sites that require login), see the Authenticated Browsing Guide.
The Data
All numbers in this post come from WebTaskBench, an open benchmark of 100 web tasks across 50 real-world URLs. You can run it yourself and reproduce every number.
Further Reading
- SOM Spec v1.0, The complete specification
- SOM-first Websites, How publishers can serve SOM natively
- LangChain integration, Use SOM in LangChain pipelines
- GitHub, Star us if this was useful
-
npm,
npm install -g plasmate
Top comments (0)