Massi

Posted on Mar 24

I built a web scraper in Rust that bypasses Cloudflare without a browser

#rust #ai #opensource #webscraping

Every AI agent has the same problem. You ask it to read a webpage and it comes back with a 403, or worse, 5000 tokens of navigation bars and cookie banners.

I spent the last few months building webclaw to fix this.

The problem

Try fetching any real website with a standard HTTP client. Most of them will block you. Cloudflare, Akamai, DataDome, they all look at your TLS fingerprint before the request even reaches the server.

The usual fix is spinning up a headless Chrome. That works, but now you need 500MB of browser, it takes 2-3 seconds per page, and you still get all the HTML noise.

What webclaw does differently

Instead of launching a browser, webclaw impersonates one at the TLS level. The TCP handshake, cipher suites, extensions, everything looks like Chrome 142. Most anti-bot systems pass the request through because the fingerprint is already valid.

Then the extraction engine scores every DOM node by text density, semantic tags, and link ratio. Navigation, ads, footers, cookie banners get stripped. What comes out is clean markdown.

A real example: a news article that is 4,820 tokens as raw HTML becomes 1,590 tokens after webclaw processes it. Same content, 67% less tokens.

Architecture

webclaw is a Rust workspace with 6 crates:

webclaw-core    pure extraction, zero network deps, WASM-safe
webclaw-fetch   HTTP + TLS fingerprinting via primp
webclaw-llm     LLM provider chain (Ollama > OpenAI > Anthropic)
webclaw-pdf     PDF text extraction
webclaw-cli     CLI binary
webclaw-mcp     MCP server for AI agents

The split between core and fetch was intentional. webclaw-core takes a &str of HTML and returns structured output. No I/O, no network calls, no allocator tricks. It should compile to WASM without changes.

Extraction speed on the core alone (no network):

Page size	Time
10 KB	0.8ms
100 KB	3.2ms
500 KB	12.1ms

How to use it

CLI

# basic extraction
webclaw https://example.com

# different output formats
webclaw https://example.com -f json
webclaw https://example.com -f llm

# crawl a docs site
webclaw https://docs.example.com --crawl --depth 2

# extract structured data with LLM
webclaw https://example.com --extract-prompt "get all pricing tiers"

# track page changes
webclaw https://example.com -f json > snapshot.json
webclaw https://example.com --diff-with snapshot.json

MCP server (for Claude, Cursor, Windsurf, Codex)

npx create-webclaw

One command. It detects what AI tools you have installed and writes the config for each one. After restart you get 10 tools: scrape, crawl, search, extract, summarize, brand, diff, map, batch, research.

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

128 MB image. Works on any machine.

Benchmarks

Tested on 50 real pages across news sites, documentation, e-commerce, SPAs, and blogs.

Metric	webclaw	readability	trafilatura	newspaper3k
Extraction accuracy	95.1%	83%	80%	66%
Noise removal	96.1%	79%	73%	61%

The biggest wins are on JavaScript heavy sites. When the visible DOM is empty because content is in embedded JSON (Next.js, React SSR payloads), webclaw has a data island extractor that pulls content from __NEXT_DATA__, window.__data, and similar patterns. Most other tools return nothing.

What I learned building this

TLS fingerprinting is fragile. Chrome updates their cipher suites every few versions and you have to keep up. I am using primp, which maintains patched forks of rustls, hyper, and h2. It works well but it is a maintenance burden. If Chrome ships a new TLS extension tomorrow, requests start getting blocked until the forks are updated.

The extraction scoring took the most iteration. Early versions were too aggressive and would strip content that looked like navigation (short paragraphs with links). The fix was a semantic bonus system: nodes inside <article> or <main> tags get a score boost, nodes with content-related class names get another boost. Combined with link density penalties, it handles most layouts without site-specific rules.

Try it

MIT licensed, fully open source.

GitHub: https://github.com/0xMassi/webclaw
Website: https://webclaw.io
Discord: https://discord.gg/KDfd48EpnW

If you run into a site that webclaw fails on, open an issue. Every edge case makes the extraction better.

Top comments (1)

Botánica Andina • Mar 29

The token reduction is seriously impressive – 67% less for a news article is huge. I'm curious about the extraction engine's node scoring; could you dive a bit deeper into how "semantic tags" and "link ratio" specifically help clean up the content?