Massi

Posted on May 13

Raw HTML is where LLM context goes to die

#webscraping #webdev #ai #llm

The fastest way to make an AI agent look stupid is to give it too much web page.

Not too little.

Too much.

I have seen this pattern over and over while building webclaw, a web extraction API, CLI, and MCP server for agents and LLM apps:

Fetch a URL.
Send the HTML to the model.
Ask for a summary, answer, extraction, or decision.
Wonder why the output is noisy.

It feels reasonable at first.

HTML is the source, right? More source means more context. More context means better answer.

Except that is usually not what happens.

Most raw HTML is not content. It is layout, navigation, tracking, hydration payloads, cookie banners, duplicated links, CSS class soup, script tags, modals, footer links, and invisible app state.

The model does not know which parts are expensive junk and which parts are the actual page.

You paid for all of it anyway.

The bad pipeline

This is the pipeline I see a lot:

URL -> fetch -> raw HTML -> LLM

It is simple. It demos well. It works on tiny pages.

Then you point it at real sites.

Suddenly your model is reading navigation, footers, scripts, cookie banners, duplicated links, hidden mobile markup, and a tiny slice of useful content buried somewhere in the middle.

If you are building a scraper, this is annoying.

If you are building an agent, it is worse.

The agent is not just parsing text. It is using that text to decide what to do next.

Bad context becomes bad behavior.

HTML is not neutral input

Raw HTML has a few failure modes that are easy to miss.

1. Token waste

The most obvious problem is cost.

If the useful page content is 900 words and the HTML payload is 120,000 characters, you are paying to process a lot of noise.

That noise can include:

navigation
footers
CSS class names
tracking snippets
JSON state blobs
cookie banners
related posts
ads
duplicated links
accessibility labels
hidden mobile markup

Large context windows made this worse in a funny way.

When context was small, everyone had to think about what to send.

Now it is tempting to throw the whole page into the prompt and call it engineering.

2. The model sees structure you did not mean to give it

HTML carries structure, but not always the structure you care about.

A model might see the same navigation text on every page and treat it as important. It might mix footer links into extracted results. It might preserve irrelevant menu text because it appears before the article.

This is how you get summaries that start with:

This page discusses pricing, docs, login, careers...

No, it does not.

The navigation did.

3. Boilerplate poisons retrieval

For RAG, this gets nasty.

Imagine crawling 200 documentation pages and chunking raw or poorly cleaned text.

Every chunk gets some version of:

Home
Docs
API Reference
Pricing
Contact
Sign in

Now your vector database contains hundreds of chunks with the same boilerplate.

Search quality drops because the repeated text becomes part of the retrieval surface.

The model retrieves pages because they share layout text, not because they answer the question.

This is the part that feels invisible until the system gets just big enough to be frustrating.

4. The page can be successfully fetched and still be useless

This is the one that made me care about extraction quality more than status codes.

A fetch can return 200 OK and still give you:

an empty app shell
a bot challenge
a login wall
a consent screen
a region block
a page where the useful content lives in a hydration blob

From the outside, your code worked.

From the model's point of view, the context is garbage.

This is why I do not think the right question is:

Can I fetch this page?

The better question is:

Can I return useful context from this page?

Markdown is usually a better interface

Markdown is not magic.

But for LLMs, clean markdown is often a much better intermediate format than HTML.

Good markdown keeps the parts models care about:

headings
paragraphs
lists
tables
links
code blocks
source URL
title
metadata

And removes the parts they usually do not:

layout wrappers
nav junk
tracking scripts
style tags
repeated footer text
hidden UI
duplicated link blocks

The goal is not to make the page pretty.

The goal is to make the page usable as context.

The better pipeline

For most agent and RAG workflows, I prefer this shape:

URL -> fetch -> detect bad responses -> extract main content -> markdown or JSON -> LLM

That gives the model something closer to what a human would copy into notes before asking for help.

Not the whole browser document.

The actual thing.

For example, if I am building an agent that needs to inspect a docs page, I do not want it to reason over the entire DOM.

I want something closer to this:

# Authentication

Use a bearer token in the Authorization header.

Authorization: Bearer <token>

## Rate limits

Free accounts...

Not this:

<!doctype html>
<html>
  <head>
    ...
  </head>
  <body>
    <div id="__next">
      ...
    </div>
    <script src="/_next/static/chunks/..."></script>
  </body>
</html>

Where Webclaw fits

This is one of the reasons I am building webclaw.

The point is not just to fetch a page.

The point is to give an agent or LLM app clean web context in a useful shape:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Or from TypeScript:

import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({
  apiKey: process.env.WEBCLAW_API_KEY!,
});

const page = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(page.markdown);

That output is easier to summarize, chunk, embed, cite, diff, and pass into an agent loop.

Webclaw also has an MCP server, so tools like Claude, Cursor, and other MCP-compatible clients can ask for web context directly instead of pasting random HTML into the conversation.

The interface I want is boring:

agent asks for page
tool returns clean context
agent keeps working

Boring is good here.

When raw HTML is still useful

There are cases where raw HTML is exactly what you want.

If you are debugging extraction, writing selectors, preserving layout, auditing scripts, or reverse engineering page structure, raw HTML matters.

But that is not the same as saying raw HTML is the best input for the model.

Most of the time, the model does not need the DOM.

It needs the meaning.

The rule I use now

I stopped treating raw HTML as the default context format.

My current rule is:

Fetch broadly.
Extract aggressively.
Preserve structure.
Send the model the smallest useful version of the page.

That one change makes agents cheaper, faster, and less confused.

It also makes failures easier to see. If the extractor returns an empty page, a challenge, or obvious boilerplate, you can handle that before the model hallucinates a useful answer from junk.

The bigger shift

Web scraping used to be mostly about getting data out of websites.

For LLM apps, it is becoming context infrastructure.

That means the extraction layer has to care about things that old scraper scripts could ignore:

main content detection
markdown quality
metadata
source links
tables
code blocks
bad response detection
chunkability
agent tool interfaces

If your app is doing this:

URL -> raw HTML -> LLM

You can probably get a better result with:

URL -> clean markdown or JSON -> LLM

Raw HTML feels like the source of truth.

For agents, it is often just noise with angle brackets.

I wrote more about the extraction side here:

HTML to Markdown for LLMs

And the previous post in this series is here:

I stopped using headless Chrome as the default scraper

Webclaw: https://webclaw.io

DEV Community