Massi

Posted on May 15

I stopped treating headless Chrome like a scraping strategy

#webscraping #webdev #ai #llm

Headless Chrome is useful.

It is also where a lot of scraping systems go to become slow, expensive, and impossible to reason about.

I am building webclaw, a web extraction API, CLI, and MCP server for agents and LLM apps. One architecture decision keeps paying for itself:

The browser is a fallback.
Not the default.

That sounds obvious until a target page gets annoying.

Then the reflex kicks in:

Use Playwright.
Launch Chrome.
Wait for network idle.
Extract the DOM.
Ship it.

It works often enough that people confuse it with a strategy.

It is not a strategy.

It is an expensive hammer.

Browser-first scraping demos well

The browser-first pipeline is simple:

URL -> browser -> rendered DOM -> extraction

For demos, this is great.

You point Puppeteer or Playwright at a page. JavaScript runs. The DOM appears. You grab text. Everyone claps.

For some pages, that is exactly what you need.

If the page is an app shell, if the content only appears after client-side requests, if interaction matters, if visual state matters, use a browser.

No argument there.

The problem starts when every page gets treated like that page.

Most URLs in real extraction workloads are not interactive apps. They are docs pages, articles, product pages, pricing pages, changelogs, support pages, marketplace listings, blog posts, or landing pages.

The useful content is often already in the initial response.

Starting Chrome for all of them is paying rent on a mansion to open a mailbox.

The cost shows up late

Headless browser scraping looks fine at small volume.

At production volume, the tax gets very real:

startup time
memory per page
browser pool management
Docker image size
missing system dependencies
crashes
zombie processes
network idle timeouts
lower concurrency

None of this is interesting engineering.

It is just drag.

And if you are building an AI agent, it gets worse. The user is waiting. Five seconds of browser overhead is not hidden in a nightly job. It is part of the product experience.

If you are building a SaaS around extraction, it hits margins too.

Every unnecessary browser session is cost you could have avoided.

Anti-bot is not the same as rendering

This is the mistake that creates a lot of bad scraper architecture.

People mix these two questions:

Can I access the page?
Can I render the page?

They overlap, but they are not the same problem.

A target can block your default HTTP client before JavaScript matters.

A target can return a challenge page with 200 OK.

A target can load in a browser and still be useless because the session, request behavior, or response body is wrong.

On the other side, many pages do not need rendering at all. They need a better fetch path, response classification, and content extraction.

That is why "just use Playwright" is incomplete advice.

It solves one class of rendering problems.

It does not automatically solve trust, fake success, response quality, cost, or extraction output.

The failure that actually hurts

The most dangerous scrape failure is not:

403 Forbidden

That one is honest.

The dangerous failure looks like this:

HTTP 200
body downloaded
extractor ran
pipeline continued
data is wrong

The page did not fail.

It lied.

Maybe the body was a bot challenge.

Maybe it was a login wall.

Maybe it was a consent screen.

Maybe it was an empty JavaScript shell.

Maybe it was a region-specific version with the thing you needed missing.

Your code says success because the status code was green.

Your downstream system gets garbage.

For a normal scraper, that pollutes a database.

For an LLM app, it is worse.

The agent may summarize the challenge page. Your RAG index may embed navigation text. Your research workflow may cite a login wall. The model does not know the fetch layer handed it nonsense.

That is why I care more about response classification than "can it open the page?"

The architecture I trust now

The pipeline I prefer is:

URL
-> browser-like fetch
-> classify the response
-> extract useful content
-> verify the content exists
-> return markdown / JSON / metadata
-> escalate only if needed

The important part is the decision layer.

If the first response is clean, use it.

If it is a challenge, escalate.

If it is an empty app shell, render.

If it is a login wall, fail clearly.

If extraction confidence is low, surface that instead of pretending the scrape worked.

The browser is still there.

It is just not religion.

Browser fallback beats browser-first

Use a browser when:

the content only appears after JavaScript
the page requires interaction
the initial HTML is empty
a challenge needs browser execution
the task depends on rendered state

Do not use a browser just because:

the site is modern
the site uses React
the first naive request failed
you do not want to classify responses
the tutorial used Puppeteer

That distinction changes everything.

It changes latency.

It changes concurrency.

It changes infra cost.

It changes how easy the system is to debug.

Where Webclaw fits

This is the shape I am building into webclaw.

The public interface should be boring:

send URL
get clean content
move on

For API use:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

From TypeScript:

import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({
  apiKey: process.env.WEBCLAW_API_KEY!,
});

const page = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(page.markdown);

The point is not that users should configure every layer themselves.

The point is that the API should return useful context or fail clearly.

Not "200 OK, here is a challenge page, good luck."

This matters more for agents

AI agents do not need raw HTML most of the time.

They need clean context:

title
main content
links
tables
metadata
source URL
structured fields when requested

They also need the tool layer to be honest.

If a page could not be fetched, say that.

If the content is missing, say that.

If browser rendering is needed, escalate.

But do not dump whatever HTML came back into the model and hope the LLM figures it out.

That is not an agent strategy.

That is denial with tokens attached.

The rule

The rule I use now:

Fetch first.
Classify the response.
Extract useful content.
Escalate only when the page proves it needs it.

Headless Chrome is still useful.

It is just too expensive to be the first answer to every scraping problem.

I wrote the longer breakdown here:

Anti-bot scraping API: browser fallback beats browser-first

And the previous post in this series is here:

Raw HTML is where LLM context goes to die

Webclaw: https://webclaw.io

DEV Community