DEV Community

Cover image for I stopped using headless Chrome as the default scraper
Massi
Massi

Posted on

I stopped using headless Chrome as the default scraper

Headless Chrome is useful.

It is also overused.

For years, the default answer to “this page is hard to scrape” has been some version of:

Use Puppeteer.
Use Playwright.
Add stealth.
Wait for the page.
Extract the DOM.
Enter fullscreen mode Exit fullscreen mode

That works often enough that it became muscle memory. But using a browser as the first step for every page is expensive, slow, operationally annoying, and frequently unnecessary.

I’m building webclaw, a web extraction API, CLI, and MCP server for AI agents. One of the biggest architecture decisions was this:

Do not make the browser the default path.
Enter fullscreen mode Exit fullscreen mode

The browser is an escalation path. Not the baseline.

Why Browser-First Scraping Became The Default

The web changed.

Static HTML became React, Next.js, SPAs, hydration payloads, infinite scroll, client-side routing, consent banners, and heavily instrumented frontend apps.

So scrapers adapted.

Instead of fetching HTML and parsing it, developers started launching a real browser:

URL -> Puppeteer/Playwright -> Chrome -> rendered DOM -> extraction
Enter fullscreen mode Exit fullscreen mode

That made sense. A browser gives you:

  • JavaScript execution
  • a real DOM
  • navigation behavior
  • cookies and sessions
  • screenshots
  • interaction support

For some pages, you need that.

The mistake is treating those pages as the default case.

Why Browser-First Breaks Down

Headless Chrome has a cost profile that looks fine in demos and painful in production.

1. Startup Cost

Launching a browser is not free.

Even if you reuse instances, you still pay for process management, page creation, memory, timeouts, crashes, and cleanup.

For a one-off scrape, maybe that’s fine.

For agents, RAG ingestion, batch scraping, or crawl jobs, it adds up fast.

2. Memory And Concurrency

Chrome is heavy.

If your scraper needs to handle a list of URLs, you eventually hit practical limits:

  • how many pages can run at once?
  • how many browser contexts can stay alive?
  • how many failures are caused by your scraper, not the target site?
  • how much infra are you burning just to read mostly static documents?

That matters when the output you wanted was just clean markdown.

3. CI And Deployment Pain

Browser stacks are fragile in boring ways.

You deal with:

  • missing system libraries
  • browser binary downloads
  • sandbox flags
  • font/rendering differences
  • Docker image size
  • platform-specific bugs
  • random timeouts

None of this is intellectually interesting. It is just drag.

4. The Browser Does Not Automatically Solve Blocking

This is the part people learn the hard way.

Launching Chrome does not magically make traffic look trustworthy.

Modern bot protection systems look at many signals. Some are visible in the browser. Some happen before your JavaScript ever runs.

At a high level, systems may look at things like:

  • network-level request behavior
  • header shape
  • client hints
  • IP and network reputation
  • request timing
  • session history
  • whether the page response is a real document or a challenge

That does not mean “never use a browser”.

It means “browser” and “trusted request” are not the same thing.

What Replaced It

The architecture I prefer is an escalation ladder.

Start with the cheapest path that can produce correct content.

Only move to heavier paths when the response proves you need them.

The rough shape:

Step Path Why it exists
1 Browser-like fetch Cheapest path for SSR pages, docs, blogs, metadata, and data islands.
2 Content extraction Turn the useful parts into markdown, text, JSON, metadata, and links.
3 Bad-response detection Catch empty shells, challenge pages, login walls, and blocked content.
4 JavaScript rendering Use it only when useful content is missing from the fetched response.
5 Browser fallback Last resort for pages that genuinely require browser behavior.

The important part is not one magic trick.

The important part is not paying the browser tax for pages that never needed a browser.

The Fetch-First Path

Many pages already contain the useful content before frontend JavaScript runs.

It may be in:

  • server-rendered HTML
  • article body markup
  • JSON-LD
  • Open Graph metadata
  • Next.js or React hydration payloads
  • embedded CMS data
  • documentation markup

If you can fetch the page correctly and extract the main content, you can often return useful markdown without launching Chrome.

The pipeline looks more like this:

URL -> browser-like fetch -> HTML/data islands -> extractor -> markdown/JSON
Enter fullscreen mode Exit fullscreen mode

Compared to:

URL -> browser -> rendered DOM -> extractor -> markdown/JSON
Enter fullscreen mode Exit fullscreen mode
Browser-first Fetch-first
URL URL
Playwright or Puppeteer Browser-like fetch
Chrome runtime HTML plus data islands
Rendered DOM Content extractor
Markdown or JSON Markdown or JSON
Good when interaction is required Good as the default path
Expensive when used for every page Browser only when the page proves it

This matters for AI agents because they usually do not need the visual page.

They need the content.

Why This Matters More For LLM Apps

Traditional scraping often wants a database row:

name
price
rating
availability
Enter fullscreen mode Exit fullscreen mode

LLM apps want something different.

They want context.

For agents and RAG pipelines, bad extraction does not always look broken. It can look clean and still be wrong.

Examples:

  • the page was a bot challenge, but the agent summarized it anyway
  • the docs page loaded an empty shell
  • the markdown included nav text repeated across every page
  • the pricing table lost its structure
  • the source URL or title disappeared
  • a crawler pulled 100 low-value pages and missed the docs that mattered

That is why I care less about “can it fetch?” and more about:

Can it return useful, structured context?
Enter fullscreen mode Exit fullscreen mode

For webclaw, the target shape is:

URL -> clean markdown / JSON / metadata -> agent or RAG pipeline
Enter fullscreen mode Exit fullscreen mode

A Small Example

Using the API:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'
Enter fullscreen mode Exit fullscreen mode

Using TypeScript:

import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });

const result = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(result.markdown);
Enter fullscreen mode Exit fullscreen mode

For agent workflows, webclaw also ships an MCP server, so tools like Claude Code, Cursor, and other MCP-compatible clients can call scrape, crawl, map, batch, extract, summarize, diff, brand, search, and research.

That is the interface I wanted:

agent asks for a URL
tool returns clean context
agent keeps working
Enter fullscreen mode Exit fullscreen mode

Honest Limits

This architecture does not remove the need for browsers.

Some pages require real browser sessions.

Some flows require login.

Some sites should not be scraped.

Some pages have interaction-dependent content that a fetch-first approach will never see.

The point is not “never use Chrome”.

The point is:

Do not launch Chrome until the page proves it needs Chrome.
Enter fullscreen mode Exit fullscreen mode

That one rule changes cost, latency, concurrency, and reliability.

The Bigger Lesson

Web scraping is moving from selector scripts to context infrastructure.

AI agents and RAG pipelines do not just need data.

They need clean, fresh, source-linked web context in a shape models can use.

That means the extraction layer has to care about:

  • fetch quality
  • challenge detection
  • main content extraction
  • metadata
  • markdown quality
  • structured JSON
  • crawling boundaries
  • cost
  • latency
  • agent tool interfaces

That is what I’m building into webclaw.

If your workflow is:

URL -> clean markdown/JSON -> agent or RAG pipeline
Enter fullscreen mode Exit fullscreen mode

you might find it useful.

Website: https://webclaw.io

GitHub: https://github.com/0xMassi/webclaw

Top comments (0)