DEV Community: Massi

Puppeteer networkidle is not a scraping strategy

Massi — Sat, 23 May 2026 08:42:11 +0000

The most suspicious line in a lot of Puppeteer and Playwright scrapers is not page.goto.

It is this:

await page.goto(url, { waitUntil: "networkidle" })

It looks responsible.

It feels like you are saying:

Wait until the page is done.
Then scrape it.

The problem is that modern web pages are rarely "done."

Analytics keeps firing.

Personalization keeps polling.

Reviews load after the product.

Inventory loads after the variant.

Ads never stop.

Chat widgets wake up late.

Some pages go idle before the content you need appears.

Some never go idle at all.

So your scraper either waits too little, waits too long, or times out for reasons unrelated to the data you wanted.

That is not a scraping strategy.

That is outsourcing correctness to background noise.

I am building Webclaw, a web extraction API, CLI, and MCP server for AI agents and LLM apps. The rule I trust more is:

Do not wait for silence.
Wait for evidence.

Network idle is a browser lifecycle hint

networkidle is useful for some browser automation tasks.

It is not useless.

The problem is treating it like proof that the page contains the thing you need.

It does not mean:

the article body exists
the product price exists
the JSON-LD exists
the review count exists
the SPA finished hydrating
the page is not blocked
the extracted markdown is good

It only means the browser saw a quiet enough network window according to that tool's definition.

That is a very different claim.

For scraping, the readiness condition should come from the target data, not from the browser feeling calm.

The four bad waits

I see these patterns a lot.

1. Wait for network idle

await page.goto(url, { waitUntil: "networkidle" })
const html = await page.content()

This can hang on noisy pages and still miss content on pages that load important data late.

2. Wait a fixed number of seconds

await page.waitForTimeout(5000)

This is honest but crude.

It passes locally, fails in production, and gets worse when latency changes.

3. Wait for any selector

await page.waitForSelector("main")

This is better, but still weak.

main can exist before the content exists.

#root can exist while the app is still empty.

.product can exist as a skeleton loader.

4. Wait for DOMContentLoaded

await page.goto(url, { waitUntil: "domcontentloaded" })

This only tells you the initial document was parsed.

For a client-rendered app, that might be exactly the moment before anything useful happens.

What to wait for instead

You want evidence that the target content exists.

For an article:

headline exists
body has enough text
author or date is present
article schema is present
navigation is not most of the output

For a product page:

title exists
price exists
availability exists
variant data exists
product JSON-LD exists
reviews or rating exists when expected

For docs:

main heading exists
section headings exist
code blocks exist
body text crosses a minimum token count

That readiness check can be a selector, but it should not be just any selector.

It should be tied to the data you came for.

Rendering should be a fallback

Before launching a browser, ask a cheaper question:

Does the initial HTML already contain the content?

If yes, extract it.

If no, classify why.

Useful categories:

blocked response
empty app shell
hydration payload with data
hydration payload without data
bad extraction rule
client-side data dependency

Only one of those clearly says:

render this page

That distinction matters.

If the HTML contains the article, you do not need Chrome.

If the JSON-LD contains the product data, you do not need Chrome.

If the page is blocked, Chrome may not be the first fix.

If the extractor missed the content, rendering gives you a more expensive failure.

This is the longer Webclaw version of the idea:

JavaScript Rendering API for Web Scraping: when browser fallback is actually needed

A better rendering loop

The flow I prefer looks like this:

fetch raw HTML
classify the response
extract available content
score extraction quality
render only if needed
wait for target evidence
extract again
return clean markdown or JSON

When rendering is required, do not wait for generic browser quiet.

Wait for specific evidence:

price text is present
article body crosses 500 words
JSON-LD includes Product or Article
expected API response arrived
skeleton loader disappeared
main content hash stopped changing

That is slower to design than networkidle.

It is also much closer to correctness.

The content-quality check is the missing piece

Even after rendering, you still need to check the output.

The browser can finish.

The DOM can exist.

The extractor can run.

The result can still be trash.

Examples:

80 percent navigation
cookie banner text
empty product fields
missing article body
duplicate footer links
loading skeleton text

This is why a web extraction system should score the cleaned output before returning it as success.

For LLM apps, this matters even more.

Bad context does not just sit in a database.

It becomes an answer, a tool call, a summary, or a RAG chunk.

Where anti-bot fits in

JavaScript rendering and anti-bot handling overlap, but they are not the same problem.

A page can be empty because it needs JavaScript.

A page can be empty because it is blocked.

A page can be empty because your extractor is wrong.

If you collapse all three into "use browser," you lose the ability to debug.

For the anti-bot side:

The API should hide the decision

From the outside, a scraping API should not make the caller guess:

should I pass render=true?
should I wait for network idle?
should I retry in Chrome?
should I parse JSON-LD instead?
should I call the XHR endpoint?

The API should classify the page and choose the cheapest correct path.

In Webclaw, the user-facing call stays boring:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

If the initial response is enough, return clean markdown.

If the page needs rendering, render.

If the user wants typed fields, extract them with a schema.

Docs:

The rule

Do not wait for the page to be done.

Modern pages are not done.

Wait for the content you need.

Then verify the output you are about to trust.

That one change makes browser scraping less magical and a lot easier to reason about.

Project links:

Your scraper is not failing. It is being lied to.

Massi — Tue, 19 May 2026 07:43:29 +0000

The most dangerous scrape failure is not this:

403 Forbidden

That one is honest.

The dangerous one looks like this:

200 OK
body downloaded
extractor ran
pipeline continued
data is wrong

Your scraper did not fail.

It got lied to.

Maybe the body was a Cloudflare challenge.

Maybe it was a DataDome interstitial.

Maybe it was an empty React shell.

Maybe it was a consent wall.

Maybe it was a page that looked valid enough to pass your status-code check, but not valid enough to contain the data your product actually needed.

That failure is expensive because it does not stop the system.

It keeps moving.

If you are filling a database, you store garbage.

If you are building a RAG pipeline, you embed garbage.

If you are building an AI agent, the agent starts reasoning from garbage and looks confident while doing it.

This is the part of web scraping that most tutorials skip.

They teach you how to fetch.

They do not teach you how to know whether the thing you fetched is real.

I am building Webclaw, a web extraction API, CLI, and MCP server for LLM apps and agents. The rule that keeps paying for itself is simple:

Do not trust a successful request.
Classify the response.

Headless Chrome is not a correctness layer

When scraping gets annoying, the default answer is usually:

Use Playwright.
Launch Chrome.
Wait for network idle.
Extract the DOM.

Sometimes that is the right call.

If the content only exists after JavaScript runs, use a browser.

If the task depends on rendered state, use a browser.

If the page requires interaction, use a browser.

But a browser is not a magic correctness layer.

It does not automatically tell you whether the content is real.

It does not make bad sessions good.

It does not make fake success safe.

It also turns every request into the most expensive request.

That matters at scale.

Most extraction workloads are not fancy browser apps. They are product pages, docs, articles, changelogs, listings, pricing pages, support pages, and blog posts. A lot of them can be fetched, cleaned, and returned without spinning up a browser at all.

The browser should be a fallback.

Not the default.

The missing layer is response classification

A production scraper needs a step between fetch and extract:

URL
-> fetch
-> classify response
-> extract content
-> validate output
-> return markdown or JSON
-> escalate only when needed

That classification layer answers one question:

Did we get the target page, or did we get a defensive artifact pretending to be the page?

Status code is only one signal.

Modern anti-bot systems do not rely on one signal either. They stack reputation, TLS fingerprints, HTTP/2 behavior, browser fingerprints, cookies, JavaScript challenges, timing, and content flow.

Your scraper has to score across layers too.

Signal 1: status code plus challenge markers

Status codes still matter.

These are obvious candidates:

403
429
503

But a bare status code is not enough.

A 403 can mean permission denied.

A 429 can mean rate limit.

A 503 can mean the site is actually down.

The signal gets stronger when paired with challenge markers:

cf-mitigated
__cf_bm
ak_bmsc
/cdn-cgi/challenge-platform/
cf-turnstile
challenges.cloudflare.com

If you see those, do not hand the body to an LLM and call it content.

You did not scrape the page.

You scraped the bouncer.

Signal 2: transport fingerprint mismatch

Changing the User-Agent is not enough.

A request can say:

Mozilla/5.0 Chrome

while the TLS handshake says:

Python HTTP client

Modern anti-bot systems can score the connection before your HTML parser ever sees a byte of the page.

Useful signals include:

JA4 fingerprint
TLS extension order
cipher suite order
ALPN
HTTP/2 SETTINGS
header order
client hints

This is why a lot of older scraping advice aged badly.

Headers are one layer.

The connection is another.

The browser fingerprint is another.

If those layers contradict each other, the site does not need to read your JavaScript to know something is wrong.

I wrote a longer breakdown of this on the Webclaw blog here:

Anti-Bot Scraping API 2026: signals that force browser fallback

Related reads if you are debugging this layer:

Signal 3: tiny HTML that should not be tiny

This one catches a lot of fake success.

You request a product page.

You expect title, price, variants, reviews, JSON-LD, images, availability, breadcrumbs.

You get:

6 KB of HTML
no useful text
no product schema
no JSON-LD
no expected title

That is probably not a small product page.

It is probably an interstitial, shell, wall, or challenge.

The status code might still be 200.

Your extractor might still run.

Your pipeline might still continue.

This is exactly why status-code-only scraping is brittle. The page can look successful from the outside and still be useless inside.

If your tests are still running against toy URLs, I wrote about that trap here:

Stop testing scraping APIs on example.com

Signal 4: anti-bot cookies and headers

Headers and cookies are noisy.

They are still useful.

You can scan for families of defensive artifacts:

cf-ray
cf-mitigated
__cf_bm
ak_bmsc
px cookies
__ddg markers
challenge redirects
WAF body fingerprints

Do not treat one string as a guaranteed verdict.

Treat it as part of a score.

A suspicious cookie plus tiny content plus missing expected schema is much stronger than any of those signals alone.

Good anti-bot detection is not drama.

It is response classification.

Signal 5: JavaScript-only shells

Some pages are not blocked.

They are just empty until JavaScript runs.

That does not mean you need to launch Chrome immediately for every URL.

You can catch many cases cheaply:

empty app root
hydration-only shell
missing __NEXT_DATA__
missing window.__INITIAL_STATE__
Turnstile or reCAPTCHA scripts
data attributes that only hydrate client-side

If the raw response clearly cannot contain the target content, browser fallback is justified.

But the order matters.

Do the cheap check first.

Spend browser time when the page earns it.

Signal 6: extracted content quality

The final check happens after extraction.

Even if the raw HTML did not scream "blocked," the output can still tell you something is wrong.

Examples:

very low token count
missing title
missing expected schema
missing price or article body
too much navigation
too little main content
repeated cookie text

If the cleaned markdown or JSON is empty, thin, or obviously wrong, escalate or fail clearly.

Do not return garbage as success.

Bad data is worse than failed data.

Failed data gets retried.

Bad data gets trusted.

The browser fallback rule

The rule I like is:

Clean fetch first.
Classify the response.
Extract and validate.
Browser only when the site demands it.

This is how we approach it in Webclaw.

The API should be boring from the outside:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

No manual "please use headless mode" ceremony.

No browser-first tax on every request.

The classifier decides.

If the page is clean, return markdown or structured JSON quickly.

If the page is challenged, empty, or JS-only, escalate.

If the content is not trustworthy, fail clearly.

That is the difference between a scraping API that scales and one that quietly becomes another bottleneck.

The endpoint docs are here if you want the boring API details:

Why this matters more for AI agents

Classic scrapers can survive some bad rows.

AI agents are less forgiving.

If you feed an agent a challenge page, it may summarize the challenge.

If you feed it an empty shell, it may reason from nothing.

If you feed it navigation text, cookie banners, and footer links, it may treat that as context.

The model is downstream.

The fetch layer has to protect it.

Agents do not need raw HTML.

They need trustworthy web context.

That means:

real content
clean markdown
structured JSON
source URL
metadata
typed errors
browser fallback only when needed

Useful related posts:

The boring part of scraping is now the product.

Fetch honestly.

Classify aggressively.

Return clean data.

Escalate only when the site actually forces your hand.

That is the whole game.

If you want to follow the project:

I stopped treating headless Chrome like a scraping strategy

Massi — Fri, 15 May 2026 10:11:46 +0000

Headless Chrome is useful.

It is also where a lot of scraping systems go to become slow, expensive, and impossible to reason about.

I am building webclaw, a web extraction API, CLI, and MCP server for agents and LLM apps. One architecture decision keeps paying for itself:

The browser is a fallback.
Not the default.

That sounds obvious until a target page gets annoying.

Then the reflex kicks in:

Use Playwright.
Launch Chrome.
Wait for network idle.
Extract the DOM.
Ship it.

It works often enough that people confuse it with a strategy.

It is not a strategy.

It is an expensive hammer.

Browser-first scraping demos well

The browser-first pipeline is simple:

URL -> browser -> rendered DOM -> extraction

For demos, this is great.

You point Puppeteer or Playwright at a page. JavaScript runs. The DOM appears. You grab text. Everyone claps.

For some pages, that is exactly what you need.

If the page is an app shell, if the content only appears after client-side requests, if interaction matters, if visual state matters, use a browser.

No argument there.

The problem starts when every page gets treated like that page.

Most URLs in real extraction workloads are not interactive apps. They are docs pages, articles, product pages, pricing pages, changelogs, support pages, marketplace listings, blog posts, or landing pages.

The useful content is often already in the initial response.

Starting Chrome for all of them is paying rent on a mansion to open a mailbox.

The cost shows up late

Headless browser scraping looks fine at small volume.

At production volume, the tax gets very real:

startup time
memory per page
browser pool management
Docker image size
missing system dependencies
crashes
zombie processes
network idle timeouts
lower concurrency

None of this is interesting engineering.

It is just drag.

And if you are building an AI agent, it gets worse. The user is waiting. Five seconds of browser overhead is not hidden in a nightly job. It is part of the product experience.

If you are building a SaaS around extraction, it hits margins too.

Every unnecessary browser session is cost you could have avoided.

Anti-bot is not the same as rendering

This is the mistake that creates a lot of bad scraper architecture.

People mix these two questions:

Can I access the page?
Can I render the page?

They overlap, but they are not the same problem.

A target can block your default HTTP client before JavaScript matters.

A target can return a challenge page with 200 OK.

A target can load in a browser and still be useless because the session, request behavior, or response body is wrong.

On the other side, many pages do not need rendering at all. They need a better fetch path, response classification, and content extraction.

That is why "just use Playwright" is incomplete advice.

It solves one class of rendering problems.

It does not automatically solve trust, fake success, response quality, cost, or extraction output.

The failure that actually hurts

The most dangerous scrape failure is not:

403 Forbidden

That one is honest.

The dangerous failure looks like this:

HTTP 200
body downloaded
extractor ran
pipeline continued
data is wrong

The page did not fail.

It lied.

Maybe the body was a bot challenge.

Maybe it was a login wall.

Maybe it was a consent screen.

Maybe it was an empty JavaScript shell.

Maybe it was a region-specific version with the thing you needed missing.

Your code says success because the status code was green.

Your downstream system gets garbage.

For a normal scraper, that pollutes a database.

For an LLM app, it is worse.

The agent may summarize the challenge page. Your RAG index may embed navigation text. Your research workflow may cite a login wall. The model does not know the fetch layer handed it nonsense.

That is why I care more about response classification than "can it open the page?"

The architecture I trust now

The pipeline I prefer is:

URL
-> browser-like fetch
-> classify the response
-> extract useful content
-> verify the content exists
-> return markdown / JSON / metadata
-> escalate only if needed

The important part is the decision layer.

If the first response is clean, use it.

If it is a challenge, escalate.

If it is an empty app shell, render.

If it is a login wall, fail clearly.

If extraction confidence is low, surface that instead of pretending the scrape worked.

The browser is still there.

It is just not religion.

Browser fallback beats browser-first

Use a browser when:

the content only appears after JavaScript
the page requires interaction
the initial HTML is empty
a challenge needs browser execution
the task depends on rendered state

Do not use a browser just because:

the site is modern
the site uses React
the first naive request failed
you do not want to classify responses
the tutorial used Puppeteer

That distinction changes everything.

It changes latency.

It changes concurrency.

It changes infra cost.

It changes how easy the system is to debug.

Where Webclaw fits

This is the shape I am building into webclaw.

The public interface should be boring:

send URL
get clean content
move on

For API use:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

From TypeScript:

import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({
  apiKey: process.env.WEBCLAW_API_KEY!,
});

const page = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(page.markdown);

The point is not that users should configure every layer themselves.

The point is that the API should return useful context or fail clearly.

Not "200 OK, here is a challenge page, good luck."

This matters more for agents

AI agents do not need raw HTML most of the time.

They need clean context:

title
main content
links
tables
metadata
source URL
structured fields when requested

They also need the tool layer to be honest.

If a page could not be fetched, say that.

If the content is missing, say that.

If browser rendering is needed, escalate.

But do not dump whatever HTML came back into the model and hope the LLM figures it out.

That is not an agent strategy.

That is denial with tokens attached.

The rule

The rule I use now:

Fetch first.
Classify the response.
Extract useful content.
Escalate only when the page proves it needs it.

Headless Chrome is still useful.

It is just too expensive to be the first answer to every scraping problem.

I wrote the longer breakdown here:

Anti-bot scraping API: browser fallback beats browser-first

And the previous post in this series is here:

Raw HTML is where LLM context goes to die

Webclaw: https://webclaw.io

Raw HTML is where LLM context goes to die

Massi — Wed, 13 May 2026 16:31:53 +0000

The fastest way to make an AI agent look stupid is to give it too much web page.

Not too little.

Too much.

I have seen this pattern over and over while building webclaw, a web extraction API, CLI, and MCP server for agents and LLM apps:

Fetch a URL.
Send the HTML to the model.
Ask for a summary, answer, extraction, or decision.
Wonder why the output is noisy.

It feels reasonable at first.

HTML is the source, right? More source means more context. More context means better answer.

Except that is usually not what happens.

Most raw HTML is not content. It is layout, navigation, tracking, hydration payloads, cookie banners, duplicated links, CSS class soup, script tags, modals, footer links, and invisible app state.

The model does not know which parts are expensive junk and which parts are the actual page.

You paid for all of it anyway.

The bad pipeline

This is the pipeline I see a lot:

URL -> fetch -> raw HTML -> LLM

It is simple. It demos well. It works on tiny pages.

Then you point it at real sites.

Suddenly your model is reading navigation, footers, scripts, cookie banners, duplicated links, hidden mobile markup, and a tiny slice of useful content buried somewhere in the middle.

If you are building a scraper, this is annoying.

If you are building an agent, it is worse.

The agent is not just parsing text. It is using that text to decide what to do next.

Bad context becomes bad behavior.

HTML is not neutral input

Raw HTML has a few failure modes that are easy to miss.

1. Token waste

The most obvious problem is cost.

If the useful page content is 900 words and the HTML payload is 120,000 characters, you are paying to process a lot of noise.

That noise can include:

navigation
footers
CSS class names
tracking snippets
JSON state blobs
cookie banners
related posts
ads
duplicated links
accessibility labels
hidden mobile markup

Large context windows made this worse in a funny way.

When context was small, everyone had to think about what to send.

Now it is tempting to throw the whole page into the prompt and call it engineering.

2. The model sees structure you did not mean to give it

HTML carries structure, but not always the structure you care about.

A model might see the same navigation text on every page and treat it as important. It might mix footer links into extracted results. It might preserve irrelevant menu text because it appears before the article.

This is how you get summaries that start with:

This page discusses pricing, docs, login, careers...

No, it does not.

The navigation did.

3. Boilerplate poisons retrieval

For RAG, this gets nasty.

Imagine crawling 200 documentation pages and chunking raw or poorly cleaned text.

Every chunk gets some version of:

Home
Docs
API Reference
Pricing
Contact
Sign in

Now your vector database contains hundreds of chunks with the same boilerplate.

Search quality drops because the repeated text becomes part of the retrieval surface.

The model retrieves pages because they share layout text, not because they answer the question.

This is the part that feels invisible until the system gets just big enough to be frustrating.

4. The page can be successfully fetched and still be useless

This is the one that made me care about extraction quality more than status codes.

A fetch can return 200 OK and still give you:

an empty app shell
a bot challenge
a login wall
a consent screen
a region block
a page where the useful content lives in a hydration blob

From the outside, your code worked.

From the model's point of view, the context is garbage.

This is why I do not think the right question is:

Can I fetch this page?

The better question is:

Can I return useful context from this page?

Markdown is usually a better interface

Markdown is not magic.

But for LLMs, clean markdown is often a much better intermediate format than HTML.

Good markdown keeps the parts models care about:

headings
paragraphs
lists
tables
links
code blocks
source URL
title
metadata

And removes the parts they usually do not:

layout wrappers
nav junk
tracking scripts
style tags
repeated footer text
hidden UI
duplicated link blocks

The goal is not to make the page pretty.

The goal is to make the page usable as context.

The better pipeline

For most agent and RAG workflows, I prefer this shape:

URL -> fetch -> detect bad responses -> extract main content -> markdown or JSON -> LLM

That gives the model something closer to what a human would copy into notes before asking for help.

Not the whole browser document.

The actual thing.

For example, if I am building an agent that needs to inspect a docs page, I do not want it to reason over the entire DOM.

I want something closer to this:

# Authentication

Use a bearer token in the Authorization header.

Authorization: Bearer <token>

## Rate limits

Free accounts...

Not this:

<!doctype html>
<html>
  <head>
    ...
  </head>
  <body>
    <div id="__next">
      ...
    </div>
    <script src="/_next/static/chunks/..."></script>
  </body>
</html>

Where Webclaw fits

This is one of the reasons I am building webclaw.

The point is not just to fetch a page.

The point is to give an agent or LLM app clean web context in a useful shape:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Or from TypeScript:

import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({
  apiKey: process.env.WEBCLAW_API_KEY!,
});

const page = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(page.markdown);

That output is easier to summarize, chunk, embed, cite, diff, and pass into an agent loop.

Webclaw also has an MCP server, so tools like Claude, Cursor, and other MCP-compatible clients can ask for web context directly instead of pasting random HTML into the conversation.

The interface I want is boring:

agent asks for page
tool returns clean context
agent keeps working

Boring is good here.

When raw HTML is still useful

There are cases where raw HTML is exactly what you want.

If you are debugging extraction, writing selectors, preserving layout, auditing scripts, or reverse engineering page structure, raw HTML matters.

But that is not the same as saying raw HTML is the best input for the model.

Most of the time, the model does not need the DOM.

It needs the meaning.

The rule I use now

I stopped treating raw HTML as the default context format.

My current rule is:

Fetch broadly.
Extract aggressively.
Preserve structure.
Send the model the smallest useful version of the page.

That one change makes agents cheaper, faster, and less confused.

It also makes failures easier to see. If the extractor returns an empty page, a challenge, or obvious boilerplate, you can handle that before the model hallucinates a useful answer from junk.

The bigger shift

Web scraping used to be mostly about getting data out of websites.

For LLM apps, it is becoming context infrastructure.

That means the extraction layer has to care about things that old scraper scripts could ignore:

main content detection
markdown quality
metadata
source links
tables
code blocks
bad response detection
chunkability
agent tool interfaces

If your app is doing this:

URL -> raw HTML -> LLM

You can probably get a better result with:

URL -> clean markdown or JSON -> LLM

Raw HTML feels like the source of truth.

For agents, it is often just noise with angle brackets.

I wrote more about the extraction side here:

HTML to Markdown for LLMs

And the previous post in this series is here:

I stopped using headless Chrome as the default scraper

Webclaw: https://webclaw.io

I stopped using headless Chrome as the default scraper

Massi — Sat, 09 May 2026 12:06:58 +0000

Headless Chrome is useful.

It is also overused.

For years, the default answer to “this page is hard to scrape” has been some version of:

Use Puppeteer.
Use Playwright.
Add stealth.
Wait for the page.
Extract the DOM.

That works often enough that it became muscle memory. But using a browser as the first step for every page is expensive, slow, operationally annoying, and frequently unnecessary.

I’m building webclaw, a web extraction API, CLI, and MCP server for AI agents. One of the biggest architecture decisions was this:

Do not make the browser the default path.

The browser is an escalation path. Not the baseline.

Why Browser-First Scraping Became The Default

The web changed.

Static HTML became React, Next.js, SPAs, hydration payloads, infinite scroll, client-side routing, consent banners, and heavily instrumented frontend apps.

So scrapers adapted.

Instead of fetching HTML and parsing it, developers started launching a real browser:

URL -> Puppeteer/Playwright -> Chrome -> rendered DOM -> extraction

That made sense. A browser gives you:

JavaScript execution
a real DOM
navigation behavior
cookies and sessions
screenshots
interaction support

For some pages, you need that.

The mistake is treating those pages as the default case.

Why Browser-First Breaks Down

Headless Chrome has a cost profile that looks fine in demos and painful in production.

1. Startup Cost

Launching a browser is not free.

Even if you reuse instances, you still pay for process management, page creation, memory, timeouts, crashes, and cleanup.

For a one-off scrape, maybe that’s fine.

For agents, RAG ingestion, batch scraping, or crawl jobs, it adds up fast.

2. Memory And Concurrency

Chrome is heavy.

If your scraper needs to handle a list of URLs, you eventually hit practical limits:

how many pages can run at once?
how many browser contexts can stay alive?
how many failures are caused by your scraper, not the target site?
how much infra are you burning just to read mostly static documents?

That matters when the output you wanted was just clean markdown.

3. CI And Deployment Pain

Browser stacks are fragile in boring ways.

You deal with:

missing system libraries
browser binary downloads
sandbox flags
font/rendering differences
Docker image size
platform-specific bugs
random timeouts

None of this is intellectually interesting. It is just drag.

4. The Browser Does Not Automatically Solve Blocking

This is the part people learn the hard way.

Launching Chrome does not magically make traffic look trustworthy.

Modern bot protection systems look at many signals. Some are visible in the browser. Some happen before your JavaScript ever runs.

At a high level, systems may look at things like:

network-level request behavior
header shape
client hints
IP and network reputation
request timing
session history
whether the page response is a real document or a challenge

That does not mean “never use a browser”.

It means “browser” and “trusted request” are not the same thing.

What Replaced It

The architecture I prefer is an escalation ladder.

Start with the cheapest path that can produce correct content.

Only move to heavier paths when the response proves you need them.

The rough shape:

Step	Path	Why it exists
1	Browser-like fetch	Cheapest path for SSR pages, docs, blogs, metadata, and data islands.
2	Content extraction	Turn the useful parts into markdown, text, JSON, metadata, and links.
3	Bad-response detection	Catch empty shells, challenge pages, login walls, and blocked content.
4	JavaScript rendering	Use it only when useful content is missing from the fetched response.
5	Browser fallback	Last resort for pages that genuinely require browser behavior.

The important part is not one magic trick.

The important part is not paying the browser tax for pages that never needed a browser.

The Fetch-First Path

Many pages already contain the useful content before frontend JavaScript runs.

It may be in:

server-rendered HTML
article body markup
JSON-LD
Open Graph metadata
Next.js or React hydration payloads
embedded CMS data
documentation markup

If you can fetch the page correctly and extract the main content, you can often return useful markdown without launching Chrome.

The pipeline looks more like this:

URL -> browser-like fetch -> HTML/data islands -> extractor -> markdown/JSON

Compared to:

URL -> browser -> rendered DOM -> extractor -> markdown/JSON

Browser-first	Fetch-first
URL	URL
Playwright or Puppeteer	Browser-like fetch
Chrome runtime	HTML plus data islands
Rendered DOM	Content extractor
Markdown or JSON	Markdown or JSON
Good when interaction is required	Good as the default path
Expensive when used for every page	Browser only when the page proves it

This matters for AI agents because they usually do not need the visual page.

They need the content.

Why This Matters More For LLM Apps

Traditional scraping often wants a database row:

name
price
rating
availability

LLM apps want something different.

They want context.

For agents and RAG pipelines, bad extraction does not always look broken. It can look clean and still be wrong.

Examples:

the page was a bot challenge, but the agent summarized it anyway
the docs page loaded an empty shell
the markdown included nav text repeated across every page
the pricing table lost its structure
the source URL or title disappeared
a crawler pulled 100 low-value pages and missed the docs that mattered

That is why I care less about “can it fetch?” and more about:

Can it return useful, structured context?

For webclaw, the target shape is:

URL -> clean markdown / JSON / metadata -> agent or RAG pipeline

A Small Example

Using the API:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Using TypeScript:

import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });

const result = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(result.markdown);

For agent workflows, webclaw also ships an MCP server, so tools like Claude Code, Cursor, and other MCP-compatible clients can call scrape, crawl, map, batch, extract, summarize, diff, brand, search, and research.

That is the interface I wanted:

agent asks for a URL
tool returns clean context
agent keeps working

Honest Limits

This architecture does not remove the need for browsers.

Some pages require real browser sessions.

Some flows require login.

Some sites should not be scraped.

Some pages have interaction-dependent content that a fetch-first approach will never see.

The point is not “never use Chrome”.

The point is:

Do not launch Chrome until the page proves it needs Chrome.

That one rule changes cost, latency, concurrency, and reliability.

The Bigger Lesson

Web scraping is moving from selector scripts to context infrastructure.

AI agents and RAG pipelines do not just need data.

They need clean, fresh, source-linked web context in a shape models can use.

That means the extraction layer has to care about:

fetch quality
challenge detection
main content extraction
metadata
markdown quality
structured JSON
crawling boundaries
cost
latency
agent tool interfaces

That is what I’m building into webclaw.

If your workflow is:

URL -> clean markdown/JSON -> agent or RAG pipeline

you might find it useful.

Website: https://webclaw.io

GitHub: https://github.com/0xMassi/webclaw

MCP Web Scraping: Give Claude and Cursor Real Web Access

Massi — Thu, 16 Apr 2026 16:38:43 +0000

Your AI agent can write code, analyze documents, query databases, and hold long conversations. But ask it to check a competitor's pricing page, read the latest docs for a framework, or pull product specs from a supplier's website, and it hits a wall. It can't read the web.

This is the gap that MCP closes. And web scraping is the use case that makes it obvious.

What MCP actually is

MCP stands for Model Context Protocol. It's an open standard that lets AI models call external tools. Think of it like USB for AI. Before USB, every peripheral needed its own driver, its own connector, its own software. MCP does the same thing for AI tools: one protocol, any tool, any model.

The model describes what tools are available. The user (or the model itself) decides when to call one. The tool runs, returns data, and the model keeps going with the new context.

Claude Desktop, Claude Code, Cursor, Windsurf, and a growing list of other clients support MCP natively. You install an MCP server, it shows up as a set of tools your AI can call, and that's it. No API wiring, no middleware, no custom code.

The MCP SDK crossed 97 million monthly downloads. This is not experimental anymore.

Why web data is the killer MCP use case

Most MCP tools are wrappers around APIs. Connect to Slack, read a GitHub issue, query a database. Useful, but limited to services you already have access to.

Web scraping is different. It gives your AI access to the entire public web. Any URL, any page, any site. The agent decides what to read based on the conversation, not a predefined list.

This changes what agents can do.

An agent helping you evaluate SaaS tools can read their actual pricing pages instead of relying on its training data from months ago. An agent writing documentation can crawl the framework's latest docs. An agent doing competitive research can pull real numbers from public filings and product pages.

Without web access, agents are limited to what they already know. With web access, they can go find what they need. That's a fundamental capability shift.

Setting it up

webclaw ships an MCP server called webclaw-mcp with 8 tools. Install it once and your AI gets scraping, crawling, search, sitemap discovery, structured extraction, summarization, content diffing, and brand extraction.

Add this to your Claude Desktop config:

{
  "mcpServers": {
    "webclaw": {
      "command": "webclaw-mcp"
    }
  }
}

Restart Claude Desktop. The tools appear in the tool menu. Your AI can now call them during any conversation.

For Claude Code, same config in your project's .mcp.json. For Cursor, add it to the MCP settings panel.

No API key needed for the local server. It runs on your machine, uses its own HTTP client with TLS fingerprinting, and returns clean markdown. If you want to use the cloud API instead (for higher concurrency, JavaScript rendering, or anti-bot bypass), set the WEBCLAW_API_KEY environment variable and add --cloud to the command.

What the tools do

scrape reads a single URL and returns clean content. You control the format: markdown for full fidelity, llm for token-optimized output, text for plain text, json for structured metadata. The agent picks the format based on what it needs.

crawl follows links from a starting URL. It discovers pages across the site, extracts each one, and returns the full set. Useful for ingesting documentation sites, mapping a competitor's product catalog, or building a knowledge base from a company's blog.

search queries the web and returns results with snippets. When the agent needs to find information but doesn't have a specific URL, it searches first, then scrapes the most relevant results. This is how research workflows start.

map discovers all URLs on a site without scraping them. It reads the sitemap, follows internal links, and returns a clean list. The agent uses this to understand the structure of a site before deciding what to extract.

extract pulls structured data from a page using a JSON schema. The agent describes the shape of data it wants (product names and prices, contact information, event dates), and the extraction engine returns exactly that. No regex, no selectors, no brittle parsing.

summarize condenses a page into a short summary. When the agent needs the gist of an article but not the full content, this saves tokens and keeps the context window focused.

diff compares a page against a previous snapshot. The agent uses this to detect content changes: updated pricing, new product listings, modified documentation.

brand extracts visual identity from a page: colors, fonts, logos, favicons, OG images. Useful for design tools, competitive analysis, or generating brand-consistent content.

How agents actually use these

The tools are simple. What makes them powerful is how agents chain them together.

Research workflow. You ask: "Compare the pricing of webclaw, firecrawl, and scrapingbee." The agent calls search to find each pricing page. Calls scrape on each result. Extracts the relevant pricing data. Compares them in a table. All within one conversation, all with live data.

Documentation ingestion. You say: "Read the Next.js App Router docs and explain how middleware works." The agent calls map on nextjs.org/docs to find all doc pages. Calls crawl to extract the middleware-related pages. Reads the content and explains it with references to the actual documentation.

Content monitoring. You run a daily check: "Has the pricing changed on these three competitor pages?" The agent calls diff against stored snapshots. Reports what changed. Stores the new snapshots for next time.

Lead enrichment. You pass a list of company URLs. The agent calls extract on each with a schema for company name, tech stack, team size, and recent news. Returns a structured spreadsheet of enriched data.

None of this requires custom code. The agent figures out which tools to call and in what order. You describe the outcome you want in plain language.

What works well and what doesn't

MCP web scraping works best for focused, real-time extraction. Read a page, get the data, move on. The latency is low enough (100-300ms per page for static content) that it feels seamless in a conversation.

It works less well for massive scale. If you need to scrape 10,000 pages, doing it through MCP one conversation turn at a time is slow. For that, use the REST API directly with the batch or crawl endpoints, then bring the results into your agent's context.

JavaScript-heavy SPAs (React apps with client-side rendering only) sometimes return empty content through the local MCP server because it doesn't run a browser engine. The cloud API handles these through server-side JavaScript rendering, so if you're hitting SPAs, use --cloud.

Anti-bot protected sites (Cloudflare, DataDome) work fine with the TLS fingerprinting in most cases. For the hardest sites that require CAPTCHA solving, the cloud API has an antibot sidecar that handles it.

The MCP protocol itself has a limitation worth knowing: tool results are injected into the model's context window. A scrape that returns 5,000 tokens of content consumes 5,000 tokens of context. For long conversations or multi-page research, the context fills up. Using llm format instead of markdown helps because it returns 67% fewer tokens for the same content.

Beyond Claude

MCP is not Claude-specific. Any client that supports the Model Context Protocol can use webclaw-mcp. Cursor, Windsurf, Continue, and other coding tools already support MCP. OpenAI has announced MCP support. The ecosystem is converging on this standard.

This matters because the tool you install today works with every client that adopts MCP tomorrow. You're not locked into one vendor's tool ecosystem.

Getting started

Install webclaw:

npx create-webclaw

Or download a prebuilt binary from the releases page. The webclaw-mcp binary is included.

Add the config to your AI client. Start a conversation. Ask your agent to read a webpage. It will call scrape, get the content, and work with it like it was always there.

If you want the cloud API for JavaScript rendering, anti-bot bypass, and higher concurrency, sign up at webclaw.io and set your API key in the MCP config.

The MCP server is open source and AGPL-3.0 licensed. The cloud API has a free tier with 500 pages per month.

Check the MCP documentation for the full tool reference and advanced configuration.

How to turn any webpage into structured data for your LLM

Massi — Thu, 02 Apr 2026 11:37:07 +0000

Your LLM can reason, write code, and hold long conversations. Ask it to read a webpage and it falls apart. Either it can't access the URL at all, or you feed it raw HTML and burn 50,000 tokens on navigation bars, cookie banners, and CSS class names.

I've been building webclaw to fix this. It's a web extraction engine written in Rust that turns any URL into clean, structured content. No headless browser. No Selenium. Just HTTP with browser-grade TLS fingerprinting.

My first post covered how the TLS bypass works. This one covers what happens after you get the HTML: turning it into something an LLM can actually use.

The token waste problem

A typical webpage is 50,000 to 200,000 tokens of raw HTML. The actual content, the article text, the product info, the documentation, is usually 500 to 2,000 tokens. The rest is structure, styling, and UI elements that your LLM processes, reasons over, and bills you for.

If you're building a RAG pipeline, those noisy tokens pollute your vector space. Your embeddings model creates vectors for "Home | About | Contact | Blog" that compete with the actual content. Retrieval quality drops.

If you're running an agent that reads pages in a conversation, every wasted token eats context window. By page three, your agent is losing track of the conversation because the context is full of footer links.

webclaw runs a 9-step optimization pipeline that strips this noise:

Navigation, footers, cookie banners, sidebars removed
Decorative images collapsed (logo clusters become one line)
Bold/italic markers stripped (visual weight, not semantic)
Links deduplicated and collected at the end
Stat blocks merged ("100M+" and "monthly requests" become one line)
CSS artifacts and leaked framework code cleaned out

The result: 67% fewer tokens on average. On marketing pages with hero sections and testimonial carousels, it's 85-90%.

# Get LLM-optimized output from any URL
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "format": "llm"}'

Or with the CLI:

webclaw https://example.com -f llm

Read the full breakdown: HTML to Markdown for LLMs

Structured extraction: get fields, not text

Sometimes you don't need the full content. You need three fields from a product page. A price, a name, whether it's in stock.

The traditional approach is CSS selectors. Find the element, grab the text. Works until the site redesigns and your product-price class becomes pdp-price-container. Your pipeline breaks at 3am.

webclaw's /v1/extract endpoint takes a different approach. You define a JSON schema of what you want. The engine fetches the page, cleans it, and uses an LLM to extract the matching fields.

curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.example.com/product/headphones",
    "schema": {
      "type": "object",
      "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "in_stock": {"type": "boolean"},
        "rating": {"type": "number"}
      }
    }
  }'

Response:

{
  "data": {
    "product_name": "Sony WH-1000XM5",
    "price": 279.99,
    "currency": "USD",
    "in_stock": true,
    "rating": 4.7
  }
}

Same schema works on any product page regardless of their frontend framework. The site can redesign completely and extraction still works because you're extracting meaning, not DOM positions.

If you don't want to define a schema upfront, you can use a plain English prompt instead:

{
  "url": "https://company.com/about",
  "prompt": "Find the founding year, number of employees, and what the company does"
}

Building a RAG pipeline with live web data

Most RAG tutorials show you how to upload a PDF and ask questions. That's a demo, not a product. Real applications need live data. Documentation gets updated. Pricing changes. Blog posts get published.

A RAG pipeline with web data has four steps:

1. Fetch the page. Half the web is behind Cloudflare or JavaScript rendering. webclaw handles TLS fingerprinting and JS rendering automatically.

2. Extract the content. This is where most pipelines fail. Bad extraction means noisy embeddings. Noisy embeddings mean irrelevant retrieval. webclaw's LLM format gives you clean content with zero noise.

3. Chunk and embed. Since webclaw returns markdown, you can split on headings for semantically coherent chunks instead of arbitrary character counts.

import re

def split_by_headings(markdown, max_chunk=1500):
    sections = re.split(r'\n(?=#{1,3} )', markdown)
    chunks = []
    for section in sections:
        if len(section) > max_chunk:
            paragraphs = section.split('\n\n')
            current = ""
            for p in paragraphs:
                if len(current) + len(p) > max_chunk and current:
                    chunks.append(current.strip())
                    current = p
                else:
                    current += "\n\n" + p
            if current.strip():
                chunks.append(current.strip())
        else:
            chunks.append(section.strip())
    return [c for c in chunks if len(c) > 50]

4. Keep it fresh. webclaw's /v1/diff endpoint tracks content changes between snapshots. Crawl your sources on a schedule, diff against the last version, only re-embed pages that actually changed. No wasted compute.

For bulk ingestion, /v1/crawl discovers all pages on a site and /v1/batch extracts them in parallel.

Read the full guide: Build a RAG pipeline with live web data

MCP: give your AI agent web access

MCP (Model Context Protocol) is an open standard that lets AI models call external tools. Think of it like USB for AI. One protocol, any tool, any model.

webclaw ships an MCP server with 8 tools:

scrape — read any URL, get clean content
crawl — follow links across a site, extract everything
search — web search and scrape results
map — discover all URLs on a site via sitemap
extract — structured data with a JSON schema
summarize — condense a page to key points
diff — detect content changes
brand — extract colors, fonts, logos

Set it up in Claude Desktop:

{
  "mcpServers": {
    "webclaw": {
      "command": "webclaw-mcp"
    }
  }
}

Or auto-configure for Claude, Cursor, Windsurf, Codex:

npx create-webclaw

Now your AI agent can read any URL during a conversation. You ask "compare the pricing of these three SaaS tools" and the agent scrapes each pricing page, extracts the data, and builds a comparison table. No custom code.

The MCP SDK crossed 97 million monthly downloads. This is not experimental anymore. Claude Desktop, Claude Code, Cursor, Windsurf, and OpenAI all support it.

Content monitoring and change detection

If you're tracking competitors, monitoring documentation, or keeping a knowledge base fresh, you need to know when pages change.

webclaw's /v1/diff endpoint compares a page against a previous snapshot and tells you exactly what changed. Combine this with /v1/crawl on a schedule and you have a content monitoring pipeline:

Crawl your sources daily
Diff each page against the last snapshot
Re-embed only the pages that changed
Alert on significant changes

This is how you keep a RAG pipeline fresh without re-embedding everything on every cycle.

Web search built in

Sometimes your agent doesn't have a URL. It needs to find information first.

webclaw's /v1/search endpoint queries the web and returns results with snippets. Chain it with /v1/scrape and you go from a query to structured content in two calls.

curl -X POST https://api.webclaw.io/v1/search \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "best rust web frameworks 2026", "num_results": 5}'

The agent searches, picks the most relevant results, scrapes them, and synthesizes an answer. All with live data, not training data from months ago.

The full stack

webclaw is a Rust workspace with six crates. The core extraction engine has zero network dependencies and is WASM-safe. The CLI, REST API server, and MCP server are separate binaries built on the same engine.

Install the CLI:

cargo install webclaw

Or pull the Docker image:

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

The cloud API at webclaw.io adds JavaScript rendering, anti-bot bypass, LLM extraction, and higher concurrency. Free tier: 500 pages/month, no credit card.

SDKs for Python, TypeScript, and Go are coming soon.

What's next

I'm working on deep research (multi-step web research with LLM synthesis), webhook notifications for content changes, and expanding the MCP toolset.

If you're building LLM applications that need web data, give it a try. The repo is at github.com/0xMassi/webclaw. Star it if it saves you time, open an issue if something breaks.

webclaw.io | Docs | Discord

I built a web scraper in Rust that bypasses Cloudflare without a browser

Massi — Tue, 24 Mar 2026 11:39:41 +0000

Every AI agent has the same problem. You ask it to read a webpage and it comes back with a 403, or worse, 5000 tokens of navigation bars and cookie banners.

I spent the last few months building webclaw to fix this.

The problem

Try fetching any real website with a standard HTTP client. Most of them will block you. Cloudflare, Akamai, DataDome, they all look at your TLS fingerprint before the request even reaches the server.

The usual fix is spinning up a headless Chrome. That works, but now you need 500MB of browser, it takes 2-3 seconds per page, and you still get all the HTML noise.

What webclaw does differently

Instead of launching a browser, webclaw impersonates one at the TLS level. The TCP handshake, cipher suites, extensions, everything looks like Chrome 142. Most anti-bot systems pass the request through because the fingerprint is already valid.

Then the extraction engine scores every DOM node by text density, semantic tags, and link ratio. Navigation, ads, footers, cookie banners get stripped. What comes out is clean markdown.

A real example: a news article that is 4,820 tokens as raw HTML becomes 1,590 tokens after webclaw processes it. Same content, 67% less tokens.

Architecture

webclaw is a Rust workspace with 6 crates:

webclaw-core    pure extraction, zero network deps, WASM-safe
webclaw-fetch   HTTP + TLS fingerprinting via primp
webclaw-llm     LLM provider chain (Ollama > OpenAI > Anthropic)
webclaw-pdf     PDF text extraction
webclaw-cli     CLI binary
webclaw-mcp     MCP server for AI agents

The split between core and fetch was intentional. webclaw-core takes a &str of HTML and returns structured output. No I/O, no network calls, no allocator tricks. It should compile to WASM without changes.

Extraction speed on the core alone (no network):

Page size	Time
10 KB	0.8ms
100 KB	3.2ms
500 KB	12.1ms

How to use it

CLI

# basic extraction
webclaw https://example.com

# different output formats
webclaw https://example.com -f json
webclaw https://example.com -f llm

# crawl a docs site
webclaw https://docs.example.com --crawl --depth 2

# extract structured data with LLM
webclaw https://example.com --extract-prompt "get all pricing tiers"

# track page changes
webclaw https://example.com -f json > snapshot.json
webclaw https://example.com --diff-with snapshot.json

MCP server (for Claude, Cursor, Windsurf, Codex)

npx create-webclaw

One command. It detects what AI tools you have installed and writes the config for each one. After restart you get 10 tools: scrape, crawl, search, extract, summarize, brand, diff, map, batch, research.

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

128 MB image. Works on any machine.

Benchmarks

Tested on 50 real pages across news sites, documentation, e-commerce, SPAs, and blogs.

Metric	webclaw	readability	trafilatura	newspaper3k
Extraction accuracy	95.1%	83%	80%	66%
Noise removal	96.1%	79%	73%	61%

The biggest wins are on JavaScript heavy sites. When the visible DOM is empty because content is in embedded JSON (Next.js, React SSR payloads), webclaw has a data island extractor that pulls content from __NEXT_DATA__, window.__data, and similar patterns. Most other tools return nothing.

What I learned building this

TLS fingerprinting is fragile. Chrome updates their cipher suites every few versions and you have to keep up. I am using primp, which maintains patched forks of rustls, hyper, and h2. It works well but it is a maintenance burden. If Chrome ships a new TLS extension tomorrow, requests start getting blocked until the forks are updated.

The extraction scoring took the most iteration. Early versions were too aggressive and would strip content that looked like navigation (short paragraphs with links). The fix was a semantic bonus system: nodes inside <article> or <main> tags get a score boost, nodes with content-related class names get another boost. Combined with link density penalties, it handles most layouts without site-specific rules.

Try it

MIT licensed, fully open source.

GitHub: https://github.com/0xMassi/webclaw
Website: https://webclaw.io
Discord: https://discord.gg/KDfd48EpnW

If you run into a site that webclaw fails on, open an issue. Every edge case makes the extraction better.

Shipping a Production macOS App with Tauri 2.0: Code Signing, Notarization, and Homebrew

Massi — Mon, 09 Feb 2026 10:53:54 +0000

There are plenty of tutorials on building a Tauri app. Very few tell you what happens after npm run tauri build.

I recently shipped Stik, a note-capture app for macOS built with Tauri 2.0. The app itself took a few days to build. Getting it properly signed, notarized, distributed through Homebrew, and auto-updating took longer than I expected.

This post covers everything I learned. If you're building a Tauri app and plan to ship it to real users on macOS, this should save you a few days of pain.

The problem

You've built your Tauri app. It runs great in tauri dev. You run tauri build and get a .dmg. You send it to a friend. They open it and macOS says:

"App is damaged and can't be opened. You should move it to the Trash."

That's because your app isn't code signed or notarized. Apple requires both for any app distributed outside the App Store. Without them, macOS Gatekeeper blocks your app on every machine except yours.

This is where most Tauri tutorials stop and most developers get stuck.

What you actually need

Getting a Tauri app to users on macOS requires four things beyond building the binary:

Code signing: proves the app comes from a verified developer
Notarization: Apple scans the binary for malware and issues a ticket
Distribution: a way for users to install it (Homebrew, DMG, or both)
Auto-updates: so users don't get stuck on old versions forever

Let's go through each one.

Step 1: Apple Developer setup

You need an Apple Developer account ($99/year). There's no way around this for distribution outside the App Store.

Once enrolled, you need two things.

A Developer ID Application certificate. Go to Certificates, Identifiers & Profiles in your developer account. Create a "Developer ID Application" certificate. Download it and install it in your Keychain. This is what signs your app.

An app-specific password. Go to appleid.apple.com, sign in, and generate an app-specific password under Security. This is used by the notarization tool to authenticate with Apple's servers.

Export your signing certificate as a .p12 file from Keychain Access. You'll need it for CI.

Step 2: Configure Tauri for signing

In your tauri.conf.json, make sure the bundle identifier is set:

{
  "bundle": {
    "identifier": "ink.stik.app",
    "macOS": {
      "signingIdentity": "Developer ID Application: Your Name (TEAMID)",
      "entitlements": "./Entitlements.plist"
    }
  }
}

Create an Entitlements.plist in your src-tauri/ directory:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>com.apple.security.cs.allow-jit</key>
    <true/>
    <key>com.apple.security.cs.allow-unsigned-executable-memory</key>
    <true/>
    <key>com.apple.security.cs.allow-dyld-environment-variables</key>
    <true/>
</dict>
</plist>

These entitlements are needed because Tauri uses a WebView that requires JIT compilation. Without them, the app will crash on launch after notarization.

Step 3: The CI/CD pipeline

This is where it all comes together. One GitHub Actions workflow, triggered by a git tag, does everything:

Builds the Swift sidecar (if you have one) as a universal binary
Builds the Tauri app for both aarch64-apple-darwin and x86_64-apple-darwin
Signs the binary with your Developer ID
Submits it to Apple for notarization
Uploads the signed .dmg to GitHub Releases
Updates the Homebrew tap

Here's the structure:

name: Release

on:
  push:
    tags:
      - 'v*'

jobs:
  build-and-release:
    runs-on: macos-latest
    permissions:
      contents: write

    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          submodules: recursive

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - name: Setup Rust
        uses: dtolnay/rust-toolchain@stable
        with:
          targets: aarch64-apple-darwin,x86_64-apple-darwin

      - name: Rust cache
        uses: swatinem/rust-cache@v2
        with:
          workspaces: src-tauri

      - name: Import Apple signing certificate
        env:
          APPLE_CERTIFICATE: ${{ secrets.APPLE_CERTIFICATE }}
          APPLE_CERTIFICATE_PASSWORD: ${{ secrets.APPLE_CERTIFICATE_PASSWORD }}
        run: |
          CERT_PATH=$RUNNER_TEMP/certificate.p12
          KEYCHAIN_PATH=$RUNNER_TEMP/build.keychain

          echo "$APPLE_CERTIFICATE" | base64 --decode > "$CERT_PATH"

          security create-keychain -p "" "$KEYCHAIN_PATH"
          security set-keychain-settings -lut 21600 "$KEYCHAIN_PATH"
          security unlock-keychain -p "" "$KEYCHAIN_PATH"
          security import "$CERT_PATH" -P "$APPLE_CERTIFICATE_PASSWORD" -A -t cert -f pkcs12 -k "$KEYCHAIN_PATH"
          security set-key-partition-list -S apple-tool:,apple: -k "" "$KEYCHAIN_PATH"
          security list-keychains -d user -s "$KEYCHAIN_PATH" login.keychain

      - name: Build DarwinKit universal binary
        run: |
          cd src-tauri/darwinkit
          swift build -c release --arch arm64 --arch x86_64
          mkdir -p ../binaries
          BINARY=$(find .build -name darwinkit -type f -perm +111 | grep -i release | head -1)
          echo "Found binary at: $BINARY"
          cp "$BINARY" ../binaries/darwinkit-aarch64-apple-darwin
          cp "$BINARY" ../binaries/darwinkit-x86_64-apple-darwin

      - name: Install npm dependencies
        run: npm ci

      - name: Build and release (aarch64)
        uses: tauri-apps/tauri-action@v0
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          APPLE_CERTIFICATE: ${{ secrets.APPLE_CERTIFICATE }}
          APPLE_CERTIFICATE_PASSWORD: ${{ secrets.APPLE_CERTIFICATE_PASSWORD }}
          APPLE_SIGNING_IDENTITY: ${{ secrets.APPLE_SIGNING_IDENTITY }}
          APPLE_TEAM_ID: ${{ secrets.APPLE_TEAM_ID }}
          APPLE_ID: ${{ secrets.APPLE_ID }}
          APPLE_PASSWORD: ${{ secrets.APPLE_PASSWORD }}
          TAURI_SIGNING_PRIVATE_KEY: ${{ secrets.TAURI_SIGNING_PRIVATE_KEY }}
          TAURI_SIGNING_PRIVATE_KEY_PASSWORD: ${{ secrets.TAURI_SIGNING_PRIVATE_KEY_PASSWORD }}
        with:
          tagName: v__VERSION__
          releaseName: 'Stik v__VERSION__'
          releaseBody: 'See [CHANGELOG](https://github.com/0xMassi/stik_app/blob/main/CHANGELOG.md) for details.'
          releaseDraft: true
          prerelease: false
          args: --target aarch64-apple-darwin

      - name: Build and release (x86_64)
        uses: tauri-apps/tauri-action@v0
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          APPLE_CERTIFICATE: ${{ secrets.APPLE_CERTIFICATE }}
          APPLE_CERTIFICATE_PASSWORD: ${{ secrets.APPLE_CERTIFICATE_PASSWORD }}
          APPLE_SIGNING_IDENTITY: ${{ secrets.APPLE_SIGNING_IDENTITY }}
          APPLE_TEAM_ID: ${{ secrets.APPLE_TEAM_ID }}
          APPLE_ID: ${{ secrets.APPLE_ID }}
          APPLE_PASSWORD: ${{ secrets.APPLE_PASSWORD }}
          TAURI_SIGNING_PRIVATE_KEY: ${{ secrets.TAURI_SIGNING_PRIVATE_KEY }}
          TAURI_SIGNING_PRIVATE_KEY_PASSWORD: ${{ secrets.TAURI_SIGNING_PRIVATE_KEY_PASSWORD }}
        with:
          tagName: v__VERSION__
          releaseName: 'Stik v__VERSION__'
          releaseDraft: true
          prerelease: false
          args: --target x86_64-apple-darwin

      - name: Update Homebrew tap
        if: success()
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          HOMEBREW_TAP_TOKEN: ${{ secrets.HOMEBREW_TAP_TOKEN }}
        run: |
          VERSION="${GITHUB_REF_NAME#v}"

          # Download DMGs using repo-scoped GITHUB_TOKEN
          GH_TOKEN="$GITHUB_TOKEN" gh release download "$GITHUB_REF_NAME" --pattern "*.dmg" --dir "$RUNNER_TEMP"
          SHA_ARM=$(shasum -a 256 "$RUNNER_TEMP/Stik_${VERSION}_aarch64.dmg" | cut -d' ' -f1)
          SHA_INTEL=$(shasum -a 256 "$RUNNER_TEMP/Stik_${VERSION}_x64.dmg" | cut -d' ' -f1)

          # Generate updated cask formula
          cat > "$RUNNER_TEMP/stik.rb" <<CASKEOF
          cask "stik" do
            arch arm: "aarch64", intel: "x64"

            version "${VERSION}"
            sha256 arm:   "${SHA_ARM}",
                   intel: "${SHA_INTEL}"

            url "https://github.com/0xMassi/stik_app/releases/download/v#{version}/Stik_#{version}_#{arch}.dmg"
            name "Stik"
            desc "Instant thought capture - one shortcut, post-it appears, type, close"
            homepage "https://github.com/0xMassi/stik_app"

            depends_on macos: ">= :catalina"

            app "Stik.app"

            zap trash: [
              "~/Documents/Stik",
              "~/.stik",
              "~/Library/Caches/com.stik.app",
              "~/Library/WebKit/com.stik.app",
            ]
          end
          CASKEOF

          # Base64-encode the cask content
          CONTENT=$(base64 -i "$RUNNER_TEMP/stik.rb")

          # Get current file SHA from GitHub API (needed for update)
          FILE_SHA=$(GH_TOKEN="$HOMEBREW_TAP_TOKEN" gh api repos/0xMassi/homebrew-stik/contents/Casks/stik.rb --jq '.sha')

          # Push updated cask to tap repo
          GH_TOKEN="$HOMEBREW_TAP_TOKEN" gh api repos/0xMassi/homebrew-stik/contents/Casks/stik.rb \
            --method PUT \
            -f message="Update Stik to v${VERSION}" \
            -f sha="$FILE_SHA" \
            -f content="$CONTENT"

      - name: Trigger landing page rebuild
        if: success()
        env:
          VERCEL_DEPLOY_HOOK: ${{ secrets.VERCEL_DEPLOY_HOOK }}
        run: curl -s -X POST "$VERCEL_DEPLOY_HOOK"

The secrets you need

In your GitHub repo settings, add these secrets:

Secret	What it is
`APPLE_CERTIFICATE`	Your `.p12` certificate, base64 encoded
`APPLE_CERTIFICATE_PASSWORD`	The password you set when exporting the `.p12`
`APPLE_SIGNING_IDENTITY`	`Developer ID Application: Your Name (TEAMID)`
`APPLE_ID`	Your Apple ID email
`APPLE_PASSWORD`	The app-specific password from Step 1
`APPLE_TEAM_ID`	Your 10-character team ID

To base64 encode your certificate:

base64 -i Certificates.p12 | pbcopy

What tauri-action does for you

The tauri-apps/tauri-action GitHub Action handles most of the hard work. When you provide the Apple environment variables, it automatically:

Imports the certificate into a temporary keychain on the runner
Signs the app bundle with your Developer ID
Submits the app to Apple's notarization service
Staples the notarization ticket to the .dmg
Uploads the result to GitHub Releases

This saves you from writing hundreds of lines of codesign and xcrun notarytool commands yourself.

The sidecar naming problem

If you're using a Swift (or any other) sidecar binary, Tauri expects a very specific naming convention:

src-tauri/binaries/{name}-{target-triple}

For example:

src-tauri/binaries/darwinkit-aarch64-apple-darwin
src-tauri/binaries/darwinkit-x86_64-apple-darwin

If the name doesn't match exactly, Tauri won't bundle it and you'll get a runtime error when trying to spawn the sidecar. This cost me hours of debugging.

Step 4: Homebrew distribution

Homebrew is the standard way developers install tools on macOS. Getting your app into Homebrew makes installation a one-liner:

brew install --cask stik

Creating a Homebrew tap

A tap is a GitHub repository that contains your Homebrew formula. Create a repo named homebrew-{name} (for example, homebrew-stik).

Inside it, create Casks/stik.rb:

cask "stik" do
  arch arm: "aarch64", intel: "x64"

  version "0.4.0"
  sha256 arm:   "SHA256_ARM64_HERE",
         intel: "SHA256_X64_HERE"

  url "https://github.com/0xMassi/stik_app/releases/download/v#{version}/Stik_#{version}_#{arch}.dmg"
  name "Stik"
  desc "Instant thought capture for macOS"
  homepage "https://www.stik.ink"

  app "Stik.app"

  zap trash: [
    "~/Documents/Stik",
    "~/.stik",
  ]
end

Tap vs Homebrew Core

With a tap, users install with brew install 0xMassi/stik/stik. To get into Homebrew Core (just brew install --cask stik), you need to meet their inclusion criteria: the app needs to be notable, actively maintained, and have enough users. Start with a tap, submit to Core once you have traction.

Step 5: Auto-updates

If you ship v0.3.0 without an auto-updater, your early users are stuck there forever unless they manually check for updates. I learned this the hard way. I shipped the auto-updater in v0.3.3, which meant my first 100+ users needed a manual update to get it.

Tauri has a built-in updater plugin. Add it to your Cargo.toml:

[dependencies]
tauri-plugin-updater = "2"

Configure it in tauri.conf.json:

{
  "plugins": {
    "updater": {
      "endpoints": [
        "https://github.com/0xMassi/stik_app/releases/latest/download/latest.json"
      ],
      "pubkey": "YOUR_PUBLIC_KEY_HERE"
    }
  }
}

Generate the key pair with:

npx @tauri-apps/cli signer generate -w ~/.tauri/stik.key

Store the private key as a GitHub secret (TAURI_SIGNING_PRIVATE_KEY) and add the public key to tauri.conf.json. The CI pipeline will automatically sign the update bundle during build.

The latest.json file is generated by tauri-action and uploaded to your GitHub Release. It contains the download URL and signature for each platform.

On the Rust side, check for updates on app launch:

use tauri_plugin_updater::UpdaterExt;

fn main() {
    tauri::Builder::default()
        .plugin(tauri_plugin_updater::Builder::new().build())
        .setup(|app| {
            let handle = app.handle().clone();
            tauri::async_runtime::spawn(async move {
                if let Ok(Some(update)) = handle.updater().check().await {
                    let _ = update.download_and_install(|_, _| {}, || {}).await;
                }
            });
            Ok(())
        })
        .run(tauri::generate_context!())
        .expect("error running app");
}

The update downloads in the background and applies on the next restart. No user interaction needed.

The full release flow

Here's what happens when I'm ready to ship a new version:

# 1. Update version in package.json and Cargo.toml
# 2. Update CHANGELOG.md
# 3. Commit

git tag v0.4.0
git push origin v0.4.0

That's it. From here, everything is automated:

GitHub Actions detects the tag
Builds the Swift sidecar as a universal binary (arm64 + x86_64)
Builds the Tauri app for both architectures
Signs both builds with my Developer ID certificate
Submits both to Apple for notarization (takes 2-5 minutes)
Staples the notarization tickets
Uploads the .dmg files and latest.json to a new GitHub Release
Updates the Homebrew tap with new version and SHA256 hashes

Total time: about 15 minutes. Manual steps: one git tag.

Things I wish I knew earlier

Notarization can be slow. Apple's notarization service usually takes 2-5 minutes but can sometimes take 15-20 minutes. Your CI workflow needs to handle this. tauri-action polls automatically, but set a reasonable timeout.

The certificate expires. Developer ID certificates are valid for 5 years. Set a calendar reminder. If it expires, your CI pipeline breaks silently and you ship unsigned builds.

Universal binaries for sidecars. If you have a Swift sidecar, you need to build it as a universal binary (--arch arm64 --arch x86_64) so it works on both Intel and Apple Silicon Macs. Tauri won't do this for you. It only handles the Rust binary.

Test the signed build locally first. Before setting up CI, do one manual signing and notarization run on your machine. It's much easier to debug when you can see the output directly:

# Sign
codesign --deep --force --verify --verbose \
  --sign "Developer ID Application: Your Name (TEAMID)" \
  --options runtime \
  --entitlements Entitlements.plist \
  target/release/bundle/macos/YourApp.app

# Notarize
xcrun notarytool submit target/release/bundle/macos/YourApp.dmg \
  --apple-id you@email.com \
  --password xxxx-xxxx-xxxx-xxxx \
  --team-id XXXXXXXXXX \
  --wait

# Staple
xcrun stapler staple target/release/bundle/macos/YourApp.dmg

Ship the auto-updater from day one. Every user who downloads your app before the updater exists becomes a user you can't update automatically. Don't make my mistake.

Entitlements matter. If your app crashes right after notarization but works fine unsigned, it's almost certainly an entitlements issue. Tauri's WebView needs JIT and unsigned executable memory permissions. Check the entitlements section above.

Was it worth it?

Setting up this pipeline took about two days of trial and error. But since then, every release is a single command. I've shipped 4 versions in a week with zero friction.

If you're building a Tauri app and planning to distribute it to real users, invest in this infrastructure early. The time you spend on CI/CD pays for itself after the second release.

The full source is available at github.com/0xMassi/stik_app, including the complete GitHub Actions workflow. MIT licensed.

If you have questions about any of this, drop a comment or open an issue on the repo. Happy to help.