Every AI agent has the same problem. You ask it to read a webpage and it comes back with a 403, or worse, 5000 tokens of navigation bars and cookie banners.
I spent the last few months building webclaw to fix this.
The problem
Try fetching any real website with a standard HTTP client. Most of them will block you. Cloudflare, Akamai, DataDome, they all look at your TLS fingerprint before the request even reaches the server.
The usual fix is spinning up a headless Chrome. That works, but now you need 500MB of browser, it takes 2-3 seconds per page, and you still get all the HTML noise.
What webclaw does differently
Instead of launching a browser, webclaw impersonates one at the TLS level. The TCP handshake, cipher suites, extensions, everything looks like Chrome 142. Most anti-bot systems pass the request through because the fingerprint is already valid.
Then the extraction engine scores every DOM node by text density, semantic tags, and link ratio. Navigation, ads, footers, cookie banners get stripped. What comes out is clean markdown.
A real example: a news article that is 4,820 tokens as raw HTML becomes 1,590 tokens after webclaw processes it. Same content, 67% less tokens.
Architecture
webclaw is a Rust workspace with 6 crates:
webclaw-core pure extraction, zero network deps, WASM-safe
webclaw-fetch HTTP + TLS fingerprinting via primp
webclaw-llm LLM provider chain (Ollama > OpenAI > Anthropic)
webclaw-pdf PDF text extraction
webclaw-cli CLI binary
webclaw-mcp MCP server for AI agents
The split between core and fetch was intentional. webclaw-core takes a &str of HTML and returns structured output. No I/O, no network calls, no allocator tricks. It should compile to WASM without changes.
Extraction speed on the core alone (no network):
| Page size | Time |
|---|---|
| 10 KB | 0.8ms |
| 100 KB | 3.2ms |
| 500 KB | 12.1ms |
How to use it
CLI
# basic extraction
webclaw https://example.com
# different output formats
webclaw https://example.com -f json
webclaw https://example.com -f llm
# crawl a docs site
webclaw https://docs.example.com --crawl --depth 2
# extract structured data with LLM
webclaw https://example.com --extract-prompt "get all pricing tiers"
# track page changes
webclaw https://example.com -f json > snapshot.json
webclaw https://example.com --diff-with snapshot.json
MCP server (for Claude, Cursor, Windsurf, Codex)
npx create-webclaw
One command. It detects what AI tools you have installed and writes the config for each one. After restart you get 10 tools: scrape, crawl, search, extract, summarize, brand, diff, map, batch, research.
Docker
docker run --rm ghcr.io/0xmassi/webclaw https://example.com
128 MB image. Works on any machine.
Benchmarks
Tested on 50 real pages across news sites, documentation, e-commerce, SPAs, and blogs.
| Metric | webclaw | readability | trafilatura | newspaper3k |
|---|---|---|---|---|
| Extraction accuracy | 95.1% | 83% | 80% | 66% |
| Noise removal | 96.1% | 79% | 73% | 61% |
The biggest wins are on JavaScript heavy sites. When the visible DOM is empty because content is in embedded JSON (Next.js, React SSR payloads), webclaw has a data island extractor that pulls content from __NEXT_DATA__, window.__data, and similar patterns. Most other tools return nothing.
What I learned building this
TLS fingerprinting is fragile. Chrome updates their cipher suites every few versions and you have to keep up. I am using primp, which maintains patched forks of rustls, hyper, and h2. It works well but it is a maintenance burden. If Chrome ships a new TLS extension tomorrow, requests start getting blocked until the forks are updated.
The extraction scoring took the most iteration. Early versions were too aggressive and would strip content that looked like navigation (short paragraphs with links). The fix was a semantic bonus system: nodes inside <article> or <main> tags get a score boost, nodes with content-related class names get another boost. Combined with link density penalties, it handles most layouts without site-specific rules.
Try it
MIT licensed, fully open source.
GitHub: https://github.com/0xMassi/webclaw
Website: https://webclaw.io
Discord: https://discord.gg/KDfd48EpnW
If you run into a site that webclaw fails on, open an issue. Every edge case makes the extraction better.
Top comments (0)