Massi

Posted on May 9

I stopped using headless Chrome as the default scraper

#webdev #scraping #rust #ai

Headless Chrome is useful.

It is also overused.

For years, the default answer to “this page is hard to scrape” has been some version of:

Use Puppeteer.
Use Playwright.
Add stealth.
Wait for the page.
Extract the DOM.

That works often enough that it became muscle memory. But using a browser as the first step for every page is expensive, slow, operationally annoying, and frequently unnecessary.

I’m building webclaw, a web extraction API, CLI, and MCP server for AI agents. One of the biggest architecture decisions was this:

Do not make the browser the default path.

The browser is an escalation path. Not the baseline.

Why Browser-First Scraping Became The Default

The web changed.

Static HTML became React, Next.js, SPAs, hydration payloads, infinite scroll, client-side routing, consent banners, and heavily instrumented frontend apps.

So scrapers adapted.

Instead of fetching HTML and parsing it, developers started launching a real browser:

URL -> Puppeteer/Playwright -> Chrome -> rendered DOM -> extraction

That made sense. A browser gives you:

JavaScript execution
a real DOM
navigation behavior
cookies and sessions
screenshots
interaction support

For some pages, you need that.

The mistake is treating those pages as the default case.

Why Browser-First Breaks Down

Headless Chrome has a cost profile that looks fine in demos and painful in production.

1. Startup Cost

Launching a browser is not free.

Even if you reuse instances, you still pay for process management, page creation, memory, timeouts, crashes, and cleanup.

For a one-off scrape, maybe that’s fine.

For agents, RAG ingestion, batch scraping, or crawl jobs, it adds up fast.

2. Memory And Concurrency

Chrome is heavy.

If your scraper needs to handle a list of URLs, you eventually hit practical limits:

how many pages can run at once?
how many browser contexts can stay alive?
how many failures are caused by your scraper, not the target site?
how much infra are you burning just to read mostly static documents?

That matters when the output you wanted was just clean markdown.

3. CI And Deployment Pain

Browser stacks are fragile in boring ways.

You deal with:

missing system libraries
browser binary downloads
sandbox flags
font/rendering differences
Docker image size
platform-specific bugs
random timeouts

None of this is intellectually interesting. It is just drag.

4. The Browser Does Not Automatically Solve Blocking

This is the part people learn the hard way.

Launching Chrome does not magically make traffic look trustworthy.

Modern bot protection systems look at many signals. Some are visible in the browser. Some happen before your JavaScript ever runs.

At a high level, systems may look at things like:

network-level request behavior
header shape
client hints
IP and network reputation
request timing
session history
whether the page response is a real document or a challenge

That does not mean “never use a browser”.

It means “browser” and “trusted request” are not the same thing.

What Replaced It

The architecture I prefer is an escalation ladder.

Start with the cheapest path that can produce correct content.

Only move to heavier paths when the response proves you need them.

The rough shape:

Step	Path	Why it exists
1	Browser-like fetch	Cheapest path for SSR pages, docs, blogs, metadata, and data islands.
2	Content extraction	Turn the useful parts into markdown, text, JSON, metadata, and links.
3	Bad-response detection	Catch empty shells, challenge pages, login walls, and blocked content.
4	JavaScript rendering	Use it only when useful content is missing from the fetched response.
5	Browser fallback	Last resort for pages that genuinely require browser behavior.

The important part is not one magic trick.

The important part is not paying the browser tax for pages that never needed a browser.

The Fetch-First Path

Many pages already contain the useful content before frontend JavaScript runs.

It may be in:

server-rendered HTML
article body markup
JSON-LD
Open Graph metadata
Next.js or React hydration payloads
embedded CMS data
documentation markup

If you can fetch the page correctly and extract the main content, you can often return useful markdown without launching Chrome.

The pipeline looks more like this:

URL -> browser-like fetch -> HTML/data islands -> extractor -> markdown/JSON

Compared to:

URL -> browser -> rendered DOM -> extractor -> markdown/JSON

Browser-first	Fetch-first
URL	URL
Playwright or Puppeteer	Browser-like fetch
Chrome runtime	HTML plus data islands
Rendered DOM	Content extractor
Markdown or JSON	Markdown or JSON
Good when interaction is required	Good as the default path
Expensive when used for every page	Browser only when the page proves it

This matters for AI agents because they usually do not need the visual page.

They need the content.

Why This Matters More For LLM Apps

Traditional scraping often wants a database row:

name
price
rating
availability

LLM apps want something different.

They want context.

For agents and RAG pipelines, bad extraction does not always look broken. It can look clean and still be wrong.

Examples:

the page was a bot challenge, but the agent summarized it anyway
the docs page loaded an empty shell
the markdown included nav text repeated across every page
the pricing table lost its structure
the source URL or title disappeared
a crawler pulled 100 low-value pages and missed the docs that mattered

That is why I care less about “can it fetch?” and more about:

Can it return useful, structured context?

For webclaw, the target shape is:

URL -> clean markdown / JSON / metadata -> agent or RAG pipeline

A Small Example

Using the API:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Using TypeScript:

import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });

const result = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(result.markdown);

For agent workflows, webclaw also ships an MCP server, so tools like Claude Code, Cursor, and other MCP-compatible clients can call scrape, crawl, map, batch, extract, summarize, diff, brand, search, and research.

That is the interface I wanted:

agent asks for a URL
tool returns clean context
agent keeps working

Honest Limits

This architecture does not remove the need for browsers.

Some pages require real browser sessions.

Some flows require login.

Some sites should not be scraped.

Some pages have interaction-dependent content that a fetch-first approach will never see.

The point is not “never use Chrome”.

The point is:

Do not launch Chrome until the page proves it needs Chrome.

That one rule changes cost, latency, concurrency, and reliability.

The Bigger Lesson

Web scraping is moving from selector scripts to context infrastructure.

AI agents and RAG pipelines do not just need data.

They need clean, fresh, source-linked web context in a shape models can use.

That means the extraction layer has to care about:

fetch quality
challenge detection
main content extraction
metadata
markdown quality
structured JSON
crawling boundaries
cost
latency
agent tool interfaces

That is what I’m building into webclaw.

If your workflow is:

URL -> clean markdown/JSON -> agent or RAG pipeline

you might find it useful.

Website: https://webclaw.io

GitHub: https://github.com/0xMassi/webclaw

Top comments (1)

KazKN • May 11

Strong agree on treating the browser as an escalation path, not the baseline. The heuristic that has held up for me: start with HTTP + schema probes, then promote to Chrome only when initial HTML/API payloads are missing the useful state, auth/session interaction is required, or DOM state depends on user events. Also worth logging the escalation reason per URL. It turns 'Chrome is expensive' from a vibe into a cost/debug metric.