"You Probably Don't Need a Headless Browser to Feed Your RAG Pipeline"

#ai #rag #showdev #webdev

Every RAG tutorial starts the same way: spin up a headless browser, crawl the site, dump the HTML into a text splitter. It works, but the browser is doing almost nothing for you on most of the sites people actually index (docs, blogs, help centers, marketing pages), and it makes the crawl 10 to 100x more expensive than it needs to be.

Here is the boring alternative that covers the 80% case.

Plain HTTP gets you further than you think

Docs sites and blogs are overwhelmingly server-rendered or statically generated. A fetch with a sensible User-Agent and Accept: text/html returns the full content. The pages that genuinely need JavaScript rendering (SPA dashboards, infinite-scroll feeds) are usually not the pages you want in a vector database anyway.

A breadth-first crawl over plain HTTP needs three things:

URL normalization before dedup: strip fragments, drop tracking params (utm_*, fbclid, gclid), kill trailing slashes. Otherwise you fetch the same page five ways.
A same-site boundary: compare hostnames with the www. stripped, allow subdomains, and skip anything that looks like a binary (.pdf, .zip, images, fonts) before you waste a request on it.
Depth and page caps, or one link-dense page will send you crawling the whole internet.

The extraction order that works

Boilerplate removal before content extraction, not after:

const REMOVE = [
  'script', 'style', 'noscript', 'iframe', 'svg', 'form',
  'nav', 'header', 'footer', 'aside',
  '[role="navigation"]', '[role="banner"]', '[role="contentinfo"]',
  '[aria-hidden="true"]', '.cookie-banner', '.cookie-consent',
];
for (const sel of REMOVE) $(sel).remove();

Then pick the content region in this order: main, then article, then [role="main"], and only fall back to body when the page declares none of them. Modern docs frameworks (Docusaurus, VitePress, Mintlify, GitBook) all mark their content region, so the fallback rarely fires.

Convert what is left with Turndown. Two settings matter for LLM consumption: headingStyle: 'atx' so headings survive as # markers your chunker can split on, and codeBlockStyle: 'fenced' so code examples stay intact.

Why markdown and not text

Chunkers split on structure. Markdown keeps the structure (headings, lists, code fences) at a fraction of the token cost of HTML. In practice a docs page that is 40 KB of HTML becomes 2 to 4 KB of markdown with nothing a retriever cares about lost. That is a direct 10x saving on embedding cost before you have tuned anything.

The numbers

I packaged this exact approach as an Apify actor this week. A 20-page crawl of a docs site over plain HTTP finishes in under 5 seconds and costs about $0.0005 in compute. The same crawl through a headless browser takes minutes and costs cents to dimes, mostly to render navigation chrome that gets stripped anyway.

If you want to try the packaged version: Website Content Scraper returns one row per page (markdown, text, or cleaned HTML, plus title, meta description, language, canonical, word count), with include/exclude URL filters, sitemap seeding, and depth caps. The first 2 pages of every run are free, and pages that fail to fetch are never charged.

And if your target really is a JavaScript app, use a browser crawler for that one site. Just stop paying the browser tax on the 80% of sites that never needed it.