DEV Community

David Hurley
David Hurley

Posted on • Originally published at blog.plasmate.app

HTML vs Markdown vs SOM: Which Format Should Your AI Agent Use?

Every AI agent that browses the web faces the same question: how do you represent a web page to a language model?

The default answer, raw HTML, is expensive and slow. A typical page dumps 30,000+ tokens into your context window, most of it CSS classes and layout divs. But what are the actual alternatives? And do they work?

We ran WebTaskBench, 100 tasks across GPT-4o and Claude Sonnet 4, to find out. The results surprised us.


The Three Representations

When an agent needs to understand a web page, there are three common approaches:

1. Raw HTML

The DOM as-is. Every <div>, every class="sc-1234 flex items-center gap-2", every inline script. This is what most agents send today.

<div class="sc-1234 flex items-center gap-2 px-4 py-2">
  <a href="/about" class="text-blue-500 hover:underline
     font-medium tracking-tight text-sm">About</a>
  <span class="text-gray-400">|</span>
  <a href="/pricing" class="text-blue-500 hover:underline
     font-medium tracking-tight text-sm">Pricing</a>
</div>
Enter fullscreen mode Exit fullscreen mode

Pros: Complete fidelity to the DOM. No information lost.

Cons: 80-95% of tokens are noise (styling, scripts, tracking). Expensive. Slow.

2. Markdown

Strip the HTML to readable text, preserving structure through Markdown conventions. This is what tools like Jina Reader and many scraping libraries produce.

[About](/about) | [Pricing](/pricing)
Enter fullscreen mode Exit fullscreen mode

Pros: Dramatically fewer tokens. Human-readable.

Cons: Loses interactive elements. No way to know what's clickable. Navigation tasks become guesswork.

3. SOM (Semantic Object Model)

A structured JSON representation that preserves meaning and interactivity while stripping presentation noise. Each element includes its semantic role and available actions.

{
  "role": "navigation",
  "elements": [
    { "role": "link", "text": "About", "id": "e_a1b2c3", "attrs": {"href": "/about"}, "actions": ["click"] },
    { "role": "link", "text": "Pricing", "id": "e_d4e5f6", "attrs": {"href": "/pricing"}, "actions": ["click"] }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Pros: Minimal tokens. Preserves interactivity. Clear semantic roles.

Cons: Requires a SOM-aware fetcher (like Plasmate).


Token Cost Comparison

We measured input tokens across 50 web pages (news sites, documentation, e-commerce, government sites, social platforms). The differences are stark:

Token Cost Comparison:

  • HTML: 33,181 average input tokens (1.0x)
  • SOM: 8,301 average input tokens (4.0x fewer)
  • Markdown: 4,542 average input tokens (7.3x fewer)

Markdown wins on raw token count, it strips everything. But tokens aren't the whole story.

Cost Per 1,000 Pages (at $3/M input tokens)

Cost Per 1,000 Pages (at $3/M input tokens):

  • HTML: $99.54 (baseline)
  • SOM: $24.90 (75% savings)
  • Markdown: $13.63 (86% savings)

If you're just extracting text, Markdown is cheaper. But if your agent needs to interact with pages, click buttons, fill forms, navigate, Markdown falls apart.


The Latency Surprise

Here's where it gets interesting. We expected Markdown to be fastest (fewest tokens = fastest inference). That's true for GPT-4o:

GPT-4o Latency (seconds)

GPT-4o Latency (seconds):

  • HTML: 2.7s
  • Markdown: 1.9s
  • SOM: 1.4s (fastest)

SOM beats both. Why? Two reasons:

  1. Structured input parses faster. JSON with clear roles lets the model skip the "what is this?" step.
  2. Less ambiguity = shorter reasoning chains. When a link is explicitly marked "role": "link", "actions": ["click"], the model doesn't need to infer interactivity from context.

Claude Sonnet 4 Latency (seconds)

Claude Sonnet 4 Latency (seconds):

  • HTML: 16.2s
  • Markdown: 25.2s (slowest)
  • SOM: 8.5s (fastest)

Wait, Markdown is slower than HTML on Claude? Yes. And SOM is nearly 3x faster than Markdown.

Claude appears to struggle with ambiguous Markdown when the task requires understanding page structure. The model spends more time reasoning about what elements are clickable, what actions are available, and how to express those actions. With SOM, that information is explicit.


Category Breakdown

Not all tasks are equal. We tested extraction, comparison, navigation, summarization, and adversarial tasks (noisy pages with heavy chrome).

HTML/SOM Token Ratio by Category

HTML/SOM Token Ratio by Category:

  • Extraction: 2.2x (SOM wins, but margin is smaller)
  • Comparison: 3.9x (Multi-item pages benefit from structure)
  • Summarization: 3.9x (Similar to comparison)
  • Navigation: 5.4x (Interactivity data is dense in SOM)
  • Adversarial: 6.0x (Anti-bot clutter inflates HTML massively)

For adversarial pages (cookie banners, heavy JavaScript, ad-filled layouts), HTML explodes with noise while SOM stays lean. The 6x ratio means you're paying 6x more for HTML on the hardest pages.

Where Markdown Fails

Markdown works great for "read this article and summarize it." It breaks down for:

  • Form filling: Markdown can't represent input fields, dropdowns, or submit buttons
  • Navigation: No reliable way to know which text is a clickable link vs decorative
  • Stateful interactions: Multi-step flows (add to cart, checkout) require element references
  • Dynamic content: JavaScript-rendered content often doesn't survive text conversion

When to Use What

Use Markdown when:

  • Pure text extraction (summarize this article)
  • No interaction needed
  • Budget is the only constraint
  • You control the source (your own docs, known-good pages)

Use SOM when:

  • Agents need to click, type, or navigate
  • Multi-step workflows
  • Unknown or adversarial pages
  • Latency matters (SOM is fastest on both models)
  • You want consistent structure across diverse sites

Use HTML when:

  • You need pixel-perfect DOM fidelity
  • Building a browser automation tool that maps directly to CSS selectors
  • Debugging what the page actually contains

The honest recommendation: default to SOM unless you have a specific reason not to. It's faster, cheaper than HTML, and handles interactive tasks that Markdown can't.


Getting Started with Plasmate

Plasmate is the reference implementation of SOM. Three ways to use it:

1. CLI

npm install -g plasmate
plasmate fetch https://example.com
Enter fullscreen mode Exit fullscreen mode

2. MCP Server (Claude Desktop / Cursor)

{
  "mcpServers": {
    "plasmate": {
      "command": "npx",
      "args": ["-y", "plasmate", "mcp"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

3. SOM Cache API

import requests

response = requests.get(
    "https://cache.plasmate.app/v1/som",
    params={"url": "https://news.ycombinator.com"},
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
som = response.json()
Enter fullscreen mode Exit fullscreen mode

For authenticated browsing (sites that require login), see the Authenticated Browsing Guide.


The Data

All numbers in this post come from WebTaskBench, an open benchmark of 100 web tasks across 50 real-world URLs. You can run it yourself and reproduce every number.


Further Reading

Top comments (0)