Massi

Posted on Apr 2

How to turn any webpage into structured data for your LLM

#ai #webdev #python #opensource

Your LLM can reason, write code, and hold long conversations. Ask it to read a webpage and it falls apart. Either it can't access the URL at all, or you feed it raw HTML and burn 50,000 tokens on navigation bars, cookie banners, and CSS class names.

I've been building webclaw to fix this. It's a web extraction engine written in Rust that turns any URL into clean, structured content. No headless browser. No Selenium. Just HTTP with browser-grade TLS fingerprinting.

My first post covered how the TLS bypass works. This one covers what happens after you get the HTML: turning it into something an LLM can actually use.

The token waste problem

A typical webpage is 50,000 to 200,000 tokens of raw HTML. The actual content, the article text, the product info, the documentation, is usually 500 to 2,000 tokens. The rest is structure, styling, and UI elements that your LLM processes, reasons over, and bills you for.

If you're building a RAG pipeline, those noisy tokens pollute your vector space. Your embeddings model creates vectors for "Home | About | Contact | Blog" that compete with the actual content. Retrieval quality drops.

If you're running an agent that reads pages in a conversation, every wasted token eats context window. By page three, your agent is losing track of the conversation because the context is full of footer links.

webclaw runs a 9-step optimization pipeline that strips this noise:

Navigation, footers, cookie banners, sidebars removed
Decorative images collapsed (logo clusters become one line)
Bold/italic markers stripped (visual weight, not semantic)
Links deduplicated and collected at the end
Stat blocks merged ("100M+" and "monthly requests" become one line)
CSS artifacts and leaked framework code cleaned out

The result: 67% fewer tokens on average. On marketing pages with hero sections and testimonial carousels, it's 85-90%.

# Get LLM-optimized output from any URL
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "format": "llm"}'

Or with the CLI:

webclaw https://example.com -f llm

Read the full breakdown: HTML to Markdown for LLMs

Structured extraction: get fields, not text

Sometimes you don't need the full content. You need three fields from a product page. A price, a name, whether it's in stock.

The traditional approach is CSS selectors. Find the element, grab the text. Works until the site redesigns and your product-price class becomes pdp-price-container. Your pipeline breaks at 3am.

webclaw's /v1/extract endpoint takes a different approach. You define a JSON schema of what you want. The engine fetches the page, cleans it, and uses an LLM to extract the matching fields.

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.example.com/product/headphones",
    "schema": {
      "type": "object",
      "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "in_stock": {"type": "boolean"},
        "rating": {"type": "number"}
      }
    }
  }'

Response:

{
  "data": {
    "product_name": "Sony WH-1000XM5",
    "price": 279.99,
    "currency": "USD",
    "in_stock": true,
    "rating": 4.7
  }
}

Same schema works on any product page regardless of their frontend framework. The site can redesign completely and extraction still works because you're extracting meaning, not DOM positions.

If you don't want to define a schema upfront, you can use a plain English prompt instead:

{
  "url": "https://company.com/about",
  "prompt": "Find the founding year, number of employees, and what the company does"
}

Building a RAG pipeline with live web data

Most RAG tutorials show you how to upload a PDF and ask questions. That's a demo, not a product. Real applications need live data. Documentation gets updated. Pricing changes. Blog posts get published.

A RAG pipeline with web data has four steps:

1. Fetch the page. Half the web is behind Cloudflare or JavaScript rendering. webclaw handles TLS fingerprinting and JS rendering automatically.

2. Extract the content. This is where most pipelines fail. Bad extraction means noisy embeddings. Noisy embeddings mean irrelevant retrieval. webclaw's LLM format gives you clean content with zero noise.

3. Chunk and embed. Since webclaw returns markdown, you can split on headings for semantically coherent chunks instead of arbitrary character counts.

import re

def split_by_headings(markdown, max_chunk=1500):
    sections = re.split(r'\n(?=#{1,3} )', markdown)
    chunks = []
    for section in sections:
        if len(section) > max_chunk:
            paragraphs = section.split('\n\n')
            current = ""
            for p in paragraphs:
                if len(current) + len(p) > max_chunk and current:
                    chunks.append(current.strip())
                    current = p
                else:
                    current += "\n\n" + p
            if current.strip():
                chunks.append(current.strip())
        else:
            chunks.append(section.strip())
    return [c for c in chunks if len(c) > 50]

4. Keep it fresh. webclaw's /v1/diff endpoint tracks content changes between snapshots. Crawl your sources on a schedule, diff against the last version, only re-embed pages that actually changed. No wasted compute.

For bulk ingestion, /v1/crawl discovers all pages on a site and /v1/batch extracts them in parallel.

Read the full guide: Build a RAG pipeline with live web data

MCP: give your AI agent web access

MCP (Model Context Protocol) is an open standard that lets AI models call external tools. Think of it like USB for AI. One protocol, any tool, any model.

webclaw ships an MCP server with 8 tools:

scrape — read any URL, get clean content
crawl — follow links across a site, extract everything
search — web search and scrape results
map — discover all URLs on a site via sitemap
extract — structured data with a JSON schema
summarize — condense a page to key points
diff — detect content changes
brand — extract colors, fonts, logos

Set it up in Claude Desktop:

{
  "mcpServers": {
    "webclaw": {
      "command": "webclaw-mcp"
    }
  }
}

Or auto-configure for Claude, Cursor, Windsurf, Codex:

npx create-webclaw

Now your AI agent can read any URL during a conversation. You ask "compare the pricing of these three SaaS tools" and the agent scrapes each pricing page, extracts the data, and builds a comparison table. No custom code.

The MCP SDK crossed 97 million monthly downloads. This is not experimental anymore. Claude Desktop, Claude Code, Cursor, Windsurf, and OpenAI all support it.

Content monitoring and change detection

If you're tracking competitors, monitoring documentation, or keeping a knowledge base fresh, you need to know when pages change.

webclaw's /v1/diff endpoint compares a page against a previous snapshot and tells you exactly what changed. Combine this with /v1/crawl on a schedule and you have a content monitoring pipeline:

Crawl your sources daily
Diff each page against the last snapshot
Re-embed only the pages that changed
Alert on significant changes

This is how you keep a RAG pipeline fresh without re-embedding everything on every cycle.

Web search built in

Sometimes your agent doesn't have a URL. It needs to find information first.

webclaw's /v1/search endpoint queries the web and returns results with snippets. Chain it with /v1/scrape and you go from a query to structured content in two calls.

curl -X POST https://api.webclaw.io/v1/search \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "best rust web frameworks 2026", "num_results": 5}'

The agent searches, picks the most relevant results, scrapes them, and synthesizes an answer. All with live data, not training data from months ago.

The full stack

webclaw is a Rust workspace with six crates. The core extraction engine has zero network dependencies and is WASM-safe. The CLI, REST API server, and MCP server are separate binaries built on the same engine.

Install the CLI:

cargo install webclaw

Or pull the Docker image:

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

The cloud API at webclaw.io adds JavaScript rendering, anti-bot bypass, LLM extraction, and higher concurrency. Free tier: 500 pages/month, no credit card.

SDKs for Python, TypeScript, and Go are coming soon.

What's next

I'm working on deep research (multi-step web research with LLM synthesis), webhook notifications for content changes, and expanding the MCP toolset.

If you're building LLM applications that need web data, give it a try. The repo is at github.com/0xMassi/webclaw. Star it if it saves you time, open an issue if something breaks.

webclaw.io | Docs | Discord