<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Massi</title>
    <description>The latest articles on DEV Community by Massi (@0xmassi).</description>
    <link>https://dev.to/0xmassi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1625049%2F63826551-db5e-4182-92d5-6a234d359f6b.jpeg</url>
      <title>DEV Community: Massi</title>
      <link>https://dev.to/0xmassi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/0xmassi"/>
    <language>en</language>
    <item>
      <title>Raw HTML is where LLM context goes to die</title>
      <dc:creator>Massi</dc:creator>
      <pubDate>Wed, 13 May 2026 16:31:53 +0000</pubDate>
      <link>https://dev.to/0xmassi/raw-html-is-where-llm-context-goes-to-die-1elc</link>
      <guid>https://dev.to/0xmassi/raw-html-is-where-llm-context-goes-to-die-1elc</guid>
      <description>&lt;p&gt;The fastest way to make an AI agent look stupid is to give it too much web page.&lt;/p&gt;

&lt;p&gt;Not too little.&lt;/p&gt;

&lt;p&gt;Too much.&lt;/p&gt;

&lt;p&gt;I have seen this pattern over and over while building &lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;webclaw&lt;/a&gt;, a web extraction API, CLI, and MCP server for agents and LLM apps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fetch a URL.
Send the HTML to the model.
Ask for a summary, answer, extraction, or decision.
Wonder why the output is noisy.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It feels reasonable at first.&lt;/p&gt;

&lt;p&gt;HTML is the source, right? More source means more context. More context means better answer.&lt;/p&gt;

&lt;p&gt;Except that is usually not what happens.&lt;/p&gt;

&lt;p&gt;Most raw HTML is not content. It is layout, navigation, tracking, hydration payloads, cookie banners, duplicated links, CSS class soup, script tags, modals, footer links, and invisible app state.&lt;/p&gt;

&lt;p&gt;The model does not know which parts are expensive junk and which parts are the actual page.&lt;/p&gt;

&lt;p&gt;You paid for all of it anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bad pipeline
&lt;/h2&gt;

&lt;p&gt;This is the pipeline I see a lot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; fetch -&amp;gt; raw HTML -&amp;gt; LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is simple. It demos well. It works on tiny pages.&lt;/p&gt;

&lt;p&gt;Then you point it at real sites.&lt;/p&gt;

&lt;p&gt;Suddenly your model is reading navigation, footers, scripts, cookie banners, duplicated links, hidden mobile markup, and a tiny slice of useful content buried somewhere in the middle.&lt;/p&gt;

&lt;p&gt;If you are building a scraper, this is annoying.&lt;/p&gt;

&lt;p&gt;If you are building an agent, it is worse.&lt;/p&gt;

&lt;p&gt;The agent is not just parsing text. It is using that text to decide what to do next.&lt;/p&gt;

&lt;p&gt;Bad context becomes bad behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  HTML is not neutral input
&lt;/h2&gt;

&lt;p&gt;Raw HTML has a few failure modes that are easy to miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Token waste
&lt;/h3&gt;

&lt;p&gt;The most obvious problem is cost.&lt;/p&gt;

&lt;p&gt;If the useful page content is 900 words and the HTML payload is 120,000 characters, you are paying to process a lot of noise.&lt;/p&gt;

&lt;p&gt;That noise can include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;navigation
footers
CSS class names
tracking snippets
JSON state blobs
cookie banners
related posts
ads
duplicated links
accessibility labels
hidden mobile markup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Large context windows made this worse in a funny way.&lt;/p&gt;

&lt;p&gt;When context was small, everyone had to think about what to send.&lt;/p&gt;

&lt;p&gt;Now it is tempting to throw the whole page into the prompt and call it engineering.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The model sees structure you did not mean to give it
&lt;/h3&gt;

&lt;p&gt;HTML carries structure, but not always the structure you care about.&lt;/p&gt;

&lt;p&gt;A model might see the same navigation text on every page and treat it as important. It might mix footer links into extracted results. It might preserve irrelevant menu text because it appears before the article.&lt;/p&gt;

&lt;p&gt;This is how you get summaries that start with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This page discusses pricing, docs, login, careers...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No, it does not.&lt;/p&gt;

&lt;p&gt;The navigation did.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Boilerplate poisons retrieval
&lt;/h3&gt;

&lt;p&gt;For RAG, this gets nasty.&lt;/p&gt;

&lt;p&gt;Imagine crawling 200 documentation pages and chunking raw or poorly cleaned text.&lt;/p&gt;

&lt;p&gt;Every chunk gets some version of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Home
Docs
API Reference
Pricing
Contact
Sign in
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now your vector database contains hundreds of chunks with the same boilerplate.&lt;/p&gt;

&lt;p&gt;Search quality drops because the repeated text becomes part of the retrieval surface.&lt;/p&gt;

&lt;p&gt;The model retrieves pages because they share layout text, not because they answer the question.&lt;/p&gt;

&lt;p&gt;This is the part that feels invisible until the system gets just big enough to be frustrating.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The page can be successfully fetched and still be useless
&lt;/h3&gt;

&lt;p&gt;This is the one that made me care about extraction quality more than status codes.&lt;/p&gt;

&lt;p&gt;A fetch can return &lt;code&gt;200 OK&lt;/code&gt; and still give you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;an empty app shell
a bot challenge
a login wall
a consent screen
a region block
a page where the useful content lives in a hydration blob
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the outside, your code worked.&lt;/p&gt;

&lt;p&gt;From the model's point of view, the context is garbage.&lt;/p&gt;

&lt;p&gt;This is why I do not think the right question is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Can I fetch this page?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The better question is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Can I return useful context from this page?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Markdown is usually a better interface
&lt;/h2&gt;

&lt;p&gt;Markdown is not magic.&lt;/p&gt;

&lt;p&gt;But for LLMs, clean markdown is often a much better intermediate format than HTML.&lt;/p&gt;

&lt;p&gt;Good markdown keeps the parts models care about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;headings
paragraphs
lists
tables
links
code blocks
source URL
title
metadata
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And removes the parts they usually do not:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;layout wrappers
nav junk
tracking scripts
style tags
repeated footer text
hidden UI
duplicated link blocks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The goal is not to make the page pretty.&lt;/p&gt;

&lt;p&gt;The goal is to make the page usable as context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The better pipeline
&lt;/h2&gt;

&lt;p&gt;For most agent and RAG workflows, I prefer this shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; fetch -&amp;gt; detect bad responses -&amp;gt; extract main content -&amp;gt; markdown or JSON -&amp;gt; LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives the model something closer to what a human would copy into notes before asking for help.&lt;/p&gt;

&lt;p&gt;Not the whole browser document.&lt;/p&gt;

&lt;p&gt;The actual thing.&lt;/p&gt;

&lt;p&gt;For example, if I am building an agent that needs to inspect a docs page, I do not want it to reason over the entire DOM.&lt;/p&gt;

&lt;p&gt;I want something closer to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Authentication

Use a bearer token in the Authorization header.

Authorization: Bearer &amp;lt;token&amp;gt;

## Rate limits

Free accounts...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;!doctype html&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;html&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;head&amp;gt;&lt;/span&gt;
    ...
  &lt;span class="nt"&gt;&amp;lt;/head&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;body&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"__next"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      ...
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"/_next/static/chunks/..."&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/body&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where Webclaw fits
&lt;/h2&gt;

&lt;p&gt;This is one of the reasons I am building &lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;webclaw&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The point is not just to fetch a page.&lt;/p&gt;

&lt;p&gt;The point is to give an agent or LLM app clean web context in a useful shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.webclaw.io/v1/scrape &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$WEBCLAW_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or from TypeScript:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Webclaw&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@webclaw/sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Webclaw&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WEBCLAW_API_KEY&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://example.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;formats&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;markdown&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;only_main_content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That output is easier to summarize, chunk, embed, cite, diff, and pass into an agent loop.&lt;/p&gt;

&lt;p&gt;Webclaw also has an MCP server, so tools like Claude, Cursor, and other MCP-compatible clients can ask for web context directly instead of pasting random HTML into the conversation.&lt;/p&gt;

&lt;p&gt;The interface I want is boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent asks for page
tool returns clean context
agent keeps working
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Boring is good here.&lt;/p&gt;

&lt;h2&gt;
  
  
  When raw HTML is still useful
&lt;/h2&gt;

&lt;p&gt;There are cases where raw HTML is exactly what you want.&lt;/p&gt;

&lt;p&gt;If you are debugging extraction, writing selectors, preserving layout, auditing scripts, or reverse engineering page structure, raw HTML matters.&lt;/p&gt;

&lt;p&gt;But that is not the same as saying raw HTML is the best input for the model.&lt;/p&gt;

&lt;p&gt;Most of the time, the model does not need the DOM.&lt;/p&gt;

&lt;p&gt;It needs the meaning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule I use now
&lt;/h2&gt;

&lt;p&gt;I stopped treating raw HTML as the default context format.&lt;/p&gt;

&lt;p&gt;My current rule is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fetch broadly.
Extract aggressively.
Preserve structure.
Send the model the smallest useful version of the page.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That one change makes agents cheaper, faster, and less confused.&lt;/p&gt;

&lt;p&gt;It also makes failures easier to see. If the extractor returns an empty page, a challenge, or obvious boilerplate, you can handle that before the model hallucinates a useful answer from junk.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger shift
&lt;/h2&gt;

&lt;p&gt;Web scraping used to be mostly about getting data out of websites.&lt;/p&gt;

&lt;p&gt;For LLM apps, it is becoming context infrastructure.&lt;/p&gt;

&lt;p&gt;That means the extraction layer has to care about things that old scraper scripts could ignore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main content detection
markdown quality
metadata
source links
tables
code blocks
bad response detection
chunkability
agent tool interfaces
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your app is doing this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; raw HTML -&amp;gt; LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can probably get a better result with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; clean markdown or JSON -&amp;gt; LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Raw HTML feels like the source of truth.&lt;/p&gt;

&lt;p&gt;For agents, it is often just noise with angle brackets.&lt;/p&gt;

&lt;p&gt;I wrote more about the extraction side here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://webclaw.io/blog/html-to-markdown-for-llms" rel="noopener noreferrer"&gt;HTML to Markdown for LLMs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the previous post in this series is here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/0xmassi/i-stopped-using-headless-chrome-as-the-default-scraper-mm"&gt;I stopped using headless Chrome as the default scraper&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Webclaw: &lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;https://webclaw.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>llm</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>I stopped using headless Chrome as the default scraper</title>
      <dc:creator>Massi</dc:creator>
      <pubDate>Sat, 09 May 2026 12:06:58 +0000</pubDate>
      <link>https://dev.to/0xmassi/i-stopped-using-headless-chrome-as-the-default-scraper-mm</link>
      <guid>https://dev.to/0xmassi/i-stopped-using-headless-chrome-as-the-default-scraper-mm</guid>
      <description>&lt;p&gt;Headless Chrome is useful.&lt;/p&gt;

&lt;p&gt;It is also overused.&lt;/p&gt;

&lt;p&gt;For years, the default answer to “this page is hard to scrape” has been some version of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use Puppeteer.
Use Playwright.
Add stealth.
Wait for the page.
Extract the DOM.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That works often enough that it became muscle memory. But using a browser as the first step for every page is expensive, slow, operationally annoying, and frequently unnecessary.&lt;/p&gt;

&lt;p&gt;I’m building &lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;webclaw&lt;/a&gt;, a web extraction API, CLI, and MCP server for AI agents. One of the biggest architecture decisions was this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Do not make the browser the default path.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The browser is an escalation path. Not the baseline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Browser-First Scraping Became The Default
&lt;/h2&gt;

&lt;p&gt;The web changed.&lt;/p&gt;

&lt;p&gt;Static HTML became React, Next.js, SPAs, hydration payloads, infinite scroll, client-side routing, consent banners, and heavily instrumented frontend apps.&lt;/p&gt;

&lt;p&gt;So scrapers adapted.&lt;/p&gt;

&lt;p&gt;Instead of fetching HTML and parsing it, developers started launching a real browser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; Puppeteer/Playwright -&amp;gt; Chrome -&amp;gt; rendered DOM -&amp;gt; extraction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That made sense. A browser gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JavaScript execution&lt;/li&gt;
&lt;li&gt;a real DOM&lt;/li&gt;
&lt;li&gt;navigation behavior&lt;/li&gt;
&lt;li&gt;cookies and sessions&lt;/li&gt;
&lt;li&gt;screenshots&lt;/li&gt;
&lt;li&gt;interaction support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For some pages, you need that.&lt;/p&gt;

&lt;p&gt;The mistake is treating those pages as the default case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Browser-First Breaks Down
&lt;/h2&gt;

&lt;p&gt;Headless Chrome has a cost profile that looks fine in demos and painful in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Startup Cost
&lt;/h3&gt;

&lt;p&gt;Launching a browser is not free.&lt;/p&gt;

&lt;p&gt;Even if you reuse instances, you still pay for process management, page creation, memory, timeouts, crashes, and cleanup.&lt;/p&gt;

&lt;p&gt;For a one-off scrape, maybe that’s fine.&lt;/p&gt;

&lt;p&gt;For agents, RAG ingestion, batch scraping, or crawl jobs, it adds up fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Memory And Concurrency
&lt;/h3&gt;

&lt;p&gt;Chrome is heavy.&lt;/p&gt;

&lt;p&gt;If your scraper needs to handle a list of URLs, you eventually hit practical limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how many pages can run at once?&lt;/li&gt;
&lt;li&gt;how many browser contexts can stay alive?&lt;/li&gt;
&lt;li&gt;how many failures are caused by your scraper, not the target site?&lt;/li&gt;
&lt;li&gt;how much infra are you burning just to read mostly static documents?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That matters when the output you wanted was just clean markdown.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. CI And Deployment Pain
&lt;/h3&gt;

&lt;p&gt;Browser stacks are fragile in boring ways.&lt;/p&gt;

&lt;p&gt;You deal with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;missing system libraries&lt;/li&gt;
&lt;li&gt;browser binary downloads&lt;/li&gt;
&lt;li&gt;sandbox flags&lt;/li&gt;
&lt;li&gt;font/rendering differences&lt;/li&gt;
&lt;li&gt;Docker image size&lt;/li&gt;
&lt;li&gt;platform-specific bugs&lt;/li&gt;
&lt;li&gt;random timeouts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this is intellectually interesting. It is just drag.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The Browser Does Not Automatically Solve Blocking
&lt;/h3&gt;

&lt;p&gt;This is the part people learn the hard way.&lt;/p&gt;

&lt;p&gt;Launching Chrome does not magically make traffic look trustworthy.&lt;/p&gt;

&lt;p&gt;Modern bot protection systems look at many signals. Some are visible in the browser. Some happen before your JavaScript ever runs.&lt;/p&gt;

&lt;p&gt;At a high level, systems may look at things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;network-level request behavior&lt;/li&gt;
&lt;li&gt;header shape&lt;/li&gt;
&lt;li&gt;client hints&lt;/li&gt;
&lt;li&gt;IP and network reputation&lt;/li&gt;
&lt;li&gt;request timing&lt;/li&gt;
&lt;li&gt;session history&lt;/li&gt;
&lt;li&gt;whether the page response is a real document or a challenge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That does not mean “never use a browser”.&lt;/p&gt;

&lt;p&gt;It means “browser” and “trusted request” are not the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Replaced It
&lt;/h2&gt;

&lt;p&gt;The architecture I prefer is an escalation ladder.&lt;/p&gt;

&lt;p&gt;Start with the cheapest path that can produce correct content.&lt;/p&gt;

&lt;p&gt;Only move to heavier paths when the response proves you need them.&lt;/p&gt;

&lt;p&gt;The rough shape:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Why it exists&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Browser-like fetch&lt;/td&gt;
&lt;td&gt;Cheapest path for SSR pages, docs, blogs, metadata, and data islands.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Content extraction&lt;/td&gt;
&lt;td&gt;Turn the useful parts into markdown, text, JSON, metadata, and links.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Bad-response detection&lt;/td&gt;
&lt;td&gt;Catch empty shells, challenge pages, login walls, and blocked content.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;JavaScript rendering&lt;/td&gt;
&lt;td&gt;Use it only when useful content is missing from the fetched response.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Browser fallback&lt;/td&gt;
&lt;td&gt;Last resort for pages that genuinely require browser behavior.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The important part is not one magic trick.&lt;/p&gt;

&lt;p&gt;The important part is not paying the browser tax for pages that never needed a browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fetch-First Path
&lt;/h2&gt;

&lt;p&gt;Many pages already contain the useful content before frontend JavaScript runs.&lt;/p&gt;

&lt;p&gt;It may be in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;server-rendered HTML&lt;/li&gt;
&lt;li&gt;article body markup&lt;/li&gt;
&lt;li&gt;JSON-LD&lt;/li&gt;
&lt;li&gt;Open Graph metadata&lt;/li&gt;
&lt;li&gt;Next.js or React hydration payloads&lt;/li&gt;
&lt;li&gt;embedded CMS data&lt;/li&gt;
&lt;li&gt;documentation markup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can fetch the page correctly and extract the main content, you can often return useful markdown without launching Chrome.&lt;/p&gt;

&lt;p&gt;The pipeline looks more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; browser-like fetch -&amp;gt; HTML/data islands -&amp;gt; extractor -&amp;gt; markdown/JSON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compared to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; browser -&amp;gt; rendered DOM -&amp;gt; extractor -&amp;gt; markdown/JSON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Browser-first&lt;/th&gt;
&lt;th&gt;Fetch-first&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;URL&lt;/td&gt;
&lt;td&gt;URL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Playwright or Puppeteer&lt;/td&gt;
&lt;td&gt;Browser-like fetch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chrome runtime&lt;/td&gt;
&lt;td&gt;HTML plus data islands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rendered DOM&lt;/td&gt;
&lt;td&gt;Content extractor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown or JSON&lt;/td&gt;
&lt;td&gt;Markdown or JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Good when interaction is required&lt;/td&gt;
&lt;td&gt;Good as the default path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Expensive when used for every page&lt;/td&gt;
&lt;td&gt;Browser only when the page proves it&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This matters for AI agents because they usually do not need the visual page.&lt;/p&gt;

&lt;p&gt;They need the content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters More For LLM Apps
&lt;/h2&gt;

&lt;p&gt;Traditional scraping often wants a database row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name
price
rating
availability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LLM apps want something different.&lt;/p&gt;

&lt;p&gt;They want context.&lt;/p&gt;

&lt;p&gt;For agents and RAG pipelines, bad extraction does not always look broken. It can look clean and still be wrong.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the page was a bot challenge, but the agent summarized it anyway&lt;/li&gt;
&lt;li&gt;the docs page loaded an empty shell&lt;/li&gt;
&lt;li&gt;the markdown included nav text repeated across every page&lt;/li&gt;
&lt;li&gt;the pricing table lost its structure&lt;/li&gt;
&lt;li&gt;the source URL or title disappeared&lt;/li&gt;
&lt;li&gt;a crawler pulled 100 low-value pages and missed the docs that mattered&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why I care less about “can it fetch?” and more about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Can it return useful, structured context?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For webclaw, the target shape is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; clean markdown / JSON / metadata -&amp;gt; agent or RAG pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A Small Example
&lt;/h2&gt;

&lt;p&gt;Using the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.webclaw.io/v1/scrape &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$WEBCLAW_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using TypeScript:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Webclaw&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@webclaw/sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Webclaw&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WEBCLAW_API_KEY&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://example.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;formats&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;markdown&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;only_main_content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For agent workflows, webclaw also ships an MCP server, so tools like Claude Code, Cursor, and other MCP-compatible clients can call &lt;code&gt;scrape&lt;/code&gt;, &lt;code&gt;crawl&lt;/code&gt;, &lt;code&gt;map&lt;/code&gt;, &lt;code&gt;batch&lt;/code&gt;, &lt;code&gt;extract&lt;/code&gt;, &lt;code&gt;summarize&lt;/code&gt;, &lt;code&gt;diff&lt;/code&gt;, &lt;code&gt;brand&lt;/code&gt;, &lt;code&gt;search&lt;/code&gt;, and &lt;code&gt;research&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is the interface I wanted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent asks for a URL
tool returns clean context
agent keeps working
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Honest Limits
&lt;/h2&gt;

&lt;p&gt;This architecture does not remove the need for browsers.&lt;/p&gt;

&lt;p&gt;Some pages require real browser sessions.&lt;/p&gt;

&lt;p&gt;Some flows require login.&lt;/p&gt;

&lt;p&gt;Some sites should not be scraped.&lt;/p&gt;

&lt;p&gt;Some pages have interaction-dependent content that a fetch-first approach will never see.&lt;/p&gt;

&lt;p&gt;The point is not “never use Chrome”.&lt;/p&gt;

&lt;p&gt;The point is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Do not launch Chrome until the page proves it needs Chrome.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That one rule changes cost, latency, concurrency, and reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Lesson
&lt;/h2&gt;

&lt;p&gt;Web scraping is moving from selector scripts to context infrastructure.&lt;/p&gt;

&lt;p&gt;AI agents and RAG pipelines do not just need data.&lt;/p&gt;

&lt;p&gt;They need clean, fresh, source-linked web context in a shape models can use.&lt;/p&gt;

&lt;p&gt;That means the extraction layer has to care about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fetch quality&lt;/li&gt;
&lt;li&gt;challenge detection&lt;/li&gt;
&lt;li&gt;main content extraction&lt;/li&gt;
&lt;li&gt;metadata&lt;/li&gt;
&lt;li&gt;markdown quality&lt;/li&gt;
&lt;li&gt;structured JSON&lt;/li&gt;
&lt;li&gt;crawling boundaries&lt;/li&gt;
&lt;li&gt;cost&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;agent tool interfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is what I’m building into webclaw.&lt;/p&gt;

&lt;p&gt;If your workflow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;URL -&amp;gt; clean markdown/JSON -&amp;gt; agent or RAG pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;you might find it useful.&lt;/p&gt;

&lt;p&gt;Website: &lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;https://webclaw.io&lt;/a&gt;&lt;br&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/0xMassi/webclaw" rel="noopener noreferrer"&gt;https://github.com/0xMassi/webclaw&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>scraping</category>
      <category>rust</category>
      <category>ai</category>
    </item>
    <item>
      <title>MCP Web Scraping: Give Claude and Cursor Real Web Access</title>
      <dc:creator>Massi</dc:creator>
      <pubDate>Thu, 16 Apr 2026 16:38:43 +0000</pubDate>
      <link>https://dev.to/0xmassi/mcp-web-scraping-give-claude-and-cursor-real-web-access-m39</link>
      <guid>https://dev.to/0xmassi/mcp-web-scraping-give-claude-and-cursor-real-web-access-m39</guid>
      <description>&lt;p&gt;Your AI agent can write code, analyze documents, query databases, and hold long conversations. But ask it to check a competitor's pricing page, read the latest docs for a framework, or pull product specs from a supplier's website, and it hits a wall. It can't read the web.&lt;/p&gt;

&lt;p&gt;This is the gap that MCP closes. And web scraping is the use case that makes it obvious.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP actually is
&lt;/h2&gt;

&lt;p&gt;MCP stands for Model Context Protocol. It's an open standard that lets AI models call external tools. Think of it like USB for AI. Before USB, every peripheral needed its own driver, its own connector, its own software. MCP does the same thing for AI tools: one protocol, any tool, any model.&lt;/p&gt;

&lt;p&gt;The model describes what tools are available. The user (or the model itself) decides when to call one. The tool runs, returns data, and the model keeps going with the new context.&lt;/p&gt;

&lt;p&gt;Claude Desktop, Claude Code, Cursor, Windsurf, and a growing list of other clients support MCP natively. You install an MCP server, it shows up as a set of tools your AI can call, and that's it. No API wiring, no middleware, no custom code.&lt;/p&gt;

&lt;p&gt;The MCP SDK crossed 97 million monthly downloads. This is not experimental anymore.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why web data is the killer MCP use case
&lt;/h2&gt;

&lt;p&gt;Most MCP tools are wrappers around APIs. Connect to Slack, read a GitHub issue, query a database. Useful, but limited to services you already have access to.&lt;/p&gt;

&lt;p&gt;Web scraping is different. It gives your AI access to the entire public web. Any URL, any page, any site. The agent decides what to read based on the conversation, not a predefined list.&lt;/p&gt;

&lt;p&gt;This changes what agents can do.&lt;/p&gt;

&lt;p&gt;An agent helping you evaluate SaaS tools can read their actual pricing pages instead of relying on its training data from months ago. An agent writing documentation can crawl the framework's latest docs. An agent doing competitive research can pull real numbers from public filings and product pages.&lt;/p&gt;

&lt;p&gt;Without web access, agents are limited to what they already know. With web access, they can go find what they need. That's a fundamental capability shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting it up
&lt;/h2&gt;

&lt;p&gt;webclaw ships an MCP server called &lt;code&gt;webclaw-mcp&lt;/code&gt; with 8 tools. Install it once and your AI gets scraping, crawling, search, sitemap discovery, structured extraction, summarization, content diffing, and brand extraction.&lt;/p&gt;

&lt;p&gt;Add this to your Claude Desktop config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"webclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"webclaw-mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart Claude Desktop. The tools appear in the tool menu. Your AI can now call them during any conversation.&lt;/p&gt;

&lt;p&gt;For Claude Code, same config in your project's &lt;code&gt;.mcp.json&lt;/code&gt;. For Cursor, add it to the MCP settings panel.&lt;/p&gt;

&lt;p&gt;No API key needed for the local server. It runs on your machine, uses its own HTTP client with TLS fingerprinting, and returns clean markdown. If you want to use the cloud API instead (for higher concurrency, JavaScript rendering, or anti-bot bypass), set the &lt;code&gt;WEBCLAW_API_KEY&lt;/code&gt; environment variable and add &lt;code&gt;--cloud&lt;/code&gt; to the command.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the tools do
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;scrape&lt;/strong&gt; reads a single URL and returns clean content. You control the format: &lt;code&gt;markdown&lt;/code&gt; for full fidelity, &lt;code&gt;llm&lt;/code&gt; for token-optimized output, &lt;code&gt;text&lt;/code&gt; for plain text, &lt;code&gt;json&lt;/code&gt; for structured metadata. The agent picks the format based on what it needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;crawl&lt;/strong&gt; follows links from a starting URL. It discovers pages across the site, extracts each one, and returns the full set. Useful for ingesting documentation sites, mapping a competitor's product catalog, or building a knowledge base from a company's blog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;search&lt;/strong&gt; queries the web and returns results with snippets. When the agent needs to find information but doesn't have a specific URL, it searches first, then scrapes the most relevant results. This is how research workflows start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;map&lt;/strong&gt; discovers all URLs on a site without scraping them. It reads the sitemap, follows internal links, and returns a clean list. The agent uses this to understand the structure of a site before deciding what to extract.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;extract&lt;/strong&gt; pulls structured data from a page using a JSON schema. The agent describes the shape of data it wants (product names and prices, contact information, event dates), and the extraction engine returns exactly that. No regex, no selectors, no brittle parsing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;summarize&lt;/strong&gt; condenses a page into a short summary. When the agent needs the gist of an article but not the full content, this saves tokens and keeps the context window focused.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;diff&lt;/strong&gt; compares a page against a previous snapshot. The agent uses this to detect content changes: updated pricing, new product listings, modified documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;brand&lt;/strong&gt; extracts visual identity from a page: colors, fonts, logos, favicons, OG images. Useful for design tools, competitive analysis, or generating brand-consistent content.&lt;/p&gt;

&lt;h2&gt;
  
  
  How agents actually use these
&lt;/h2&gt;

&lt;p&gt;The tools are simple. What makes them powerful is how agents chain them together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research workflow.&lt;/strong&gt; You ask: "Compare the pricing of webclaw, firecrawl, and scrapingbee." The agent calls &lt;code&gt;search&lt;/code&gt; to find each pricing page. Calls &lt;code&gt;scrape&lt;/code&gt; on each result. Extracts the relevant pricing data. Compares them in a table. All within one conversation, all with live data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation ingestion.&lt;/strong&gt; You say: "Read the Next.js App Router docs and explain how middleware works." The agent calls &lt;code&gt;map&lt;/code&gt; on &lt;code&gt;nextjs.org/docs&lt;/code&gt; to find all doc pages. Calls &lt;code&gt;crawl&lt;/code&gt; to extract the middleware-related pages. Reads the content and explains it with references to the actual documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content monitoring.&lt;/strong&gt; You run a daily check: "Has the pricing changed on these three competitor pages?" The agent calls &lt;code&gt;diff&lt;/code&gt; against stored snapshots. Reports what changed. Stores the new snapshots for next time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lead enrichment.&lt;/strong&gt; You pass a list of company URLs. The agent calls &lt;code&gt;extract&lt;/code&gt; on each with a schema for company name, tech stack, team size, and recent news. Returns a structured spreadsheet of enriched data.&lt;/p&gt;

&lt;p&gt;None of this requires custom code. The agent figures out which tools to call and in what order. You describe the outcome you want in plain language.&lt;/p&gt;

&lt;h2&gt;
  
  
  What works well and what doesn't
&lt;/h2&gt;

&lt;p&gt;MCP web scraping works best for focused, real-time extraction. Read a page, get the data, move on. The latency is low enough (100-300ms per page for static content) that it feels seamless in a conversation.&lt;/p&gt;

&lt;p&gt;It works less well for massive scale. If you need to scrape 10,000 pages, doing it through MCP one conversation turn at a time is slow. For that, use the REST API directly with the batch or crawl endpoints, then bring the results into your agent's context.&lt;/p&gt;

&lt;p&gt;JavaScript-heavy SPAs (React apps with client-side rendering only) sometimes return empty content through the local MCP server because it doesn't run a browser engine. The cloud API handles these through server-side JavaScript rendering, so if you're hitting SPAs, use &lt;code&gt;--cloud&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Anti-bot protected sites (Cloudflare, DataDome) work fine with the TLS fingerprinting in most cases. For the hardest sites that require CAPTCHA solving, the cloud API has an antibot sidecar that handles it.&lt;/p&gt;

&lt;p&gt;The MCP protocol itself has a limitation worth knowing: tool results are injected into the model's context window. A scrape that returns 5,000 tokens of content consumes 5,000 tokens of context. For long conversations or multi-page research, the context fills up. Using &lt;code&gt;llm&lt;/code&gt; format instead of &lt;code&gt;markdown&lt;/code&gt; helps because it returns 67% fewer tokens for the same content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Claude
&lt;/h2&gt;

&lt;p&gt;MCP is not Claude-specific. Any client that supports the Model Context Protocol can use webclaw-mcp. Cursor, Windsurf, Continue, and other coding tools already support MCP. OpenAI has announced MCP support. The ecosystem is converging on this standard.&lt;/p&gt;

&lt;p&gt;This matters because the tool you install today works with every client that adopts MCP tomorrow. You're not locked into one vendor's tool ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;Install webclaw:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx create-webclaw
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or download a prebuilt binary from the &lt;a href="https://github.com/0xMassi/webclaw/releases" rel="noopener noreferrer"&gt;releases page&lt;/a&gt;. The &lt;code&gt;webclaw-mcp&lt;/code&gt; binary is included.&lt;/p&gt;

&lt;p&gt;Add the config to your AI client. Start a conversation. Ask your agent to read a webpage. It will call &lt;code&gt;scrape&lt;/code&gt;, get the content, and work with it like it was always there.&lt;/p&gt;

&lt;p&gt;If you want the cloud API for JavaScript rendering, anti-bot bypass, and higher concurrency, sign up at &lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;webclaw.io&lt;/a&gt; and set your API key in the MCP config.&lt;/p&gt;

&lt;p&gt;The MCP server is open source and AGPL-3.0 licensed. The cloud API has a free tier with 500 pages per month.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://dev.to/docs/mcp"&gt;MCP documentation&lt;/a&gt; for the full tool reference and advanced configuration.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>claude</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>How to turn any webpage into structured data for your LLM</title>
      <dc:creator>Massi</dc:creator>
      <pubDate>Thu, 02 Apr 2026 11:37:07 +0000</pubDate>
      <link>https://dev.to/0xmassi/how-to-turn-any-webpage-into-structured-data-for-your-llm-31o2</link>
      <guid>https://dev.to/0xmassi/how-to-turn-any-webpage-into-structured-data-for-your-llm-31o2</guid>
      <description>&lt;p&gt;Your LLM can reason, write code, and hold long conversations. Ask it to read a webpage and it falls apart. Either it can't access the URL at all, or you feed it raw HTML and burn 50,000 tokens on navigation bars, cookie banners, and CSS class names.&lt;/p&gt;

&lt;p&gt;I've been building &lt;a href="https://github.com/0xMassi/webclaw" rel="noopener noreferrer"&gt;webclaw&lt;/a&gt; to fix this. It's a web extraction engine written in Rust that turns any URL into clean, structured content. No headless browser. No Selenium. Just HTTP with browser-grade TLS fingerprinting.&lt;/p&gt;

&lt;p&gt;My &lt;a href="https://dev.to/0xmassi/i-built-a-web-scraper-in-rust-that-bypasses-cloudflare-without-a-browser-3c1o"&gt;first post&lt;/a&gt; covered how the TLS bypass works. This one covers what happens after you get the HTML: turning it into something an LLM can actually use.&lt;/p&gt;

&lt;h2&gt;
  
  
  The token waste problem
&lt;/h2&gt;

&lt;p&gt;A typical webpage is 50,000 to 200,000 tokens of raw HTML. The actual content, the article text, the product info, the documentation, is usually 500 to 2,000 tokens. The rest is structure, styling, and UI elements that your LLM processes, reasons over, and bills you for.&lt;/p&gt;

&lt;p&gt;If you're building a RAG pipeline, those noisy tokens pollute your vector space. Your embeddings model creates vectors for "Home | About | Contact | Blog" that compete with the actual content. Retrieval quality drops.&lt;/p&gt;

&lt;p&gt;If you're running an agent that reads pages in a conversation, every wasted token eats context window. By page three, your agent is losing track of the conversation because the context is full of footer links.&lt;/p&gt;

&lt;p&gt;webclaw runs a 9-step optimization pipeline that strips this noise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Navigation, footers, cookie banners, sidebars removed&lt;/li&gt;
&lt;li&gt;Decorative images collapsed (logo clusters become one line)&lt;/li&gt;
&lt;li&gt;Bold/italic markers stripped (visual weight, not semantic)&lt;/li&gt;
&lt;li&gt;Links deduplicated and collected at the end&lt;/li&gt;
&lt;li&gt;Stat blocks merged ("100M+" and "monthly requests" become one line)&lt;/li&gt;
&lt;li&gt;CSS artifacts and leaked framework code cleaned out&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result: 67% fewer tokens on average. On marketing pages with hero sections and testimonial carousels, it's 85-90%.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get LLM-optimized output from any URL&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.webclaw.io/v1/scrape &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_API_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"url": "https://example.com", "format": "llm"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;webclaw https://example.com &lt;span class="nt"&gt;-f&lt;/span&gt; llm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the full breakdown: &lt;a href="https://webclaw.io/blog/html-to-markdown-for-llms" rel="noopener noreferrer"&gt;HTML to Markdown for LLMs&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Structured extraction: get fields, not text
&lt;/h2&gt;

&lt;p&gt;Sometimes you don't need the full content. You need three fields from a product page. A price, a name, whether it's in stock.&lt;/p&gt;

&lt;p&gt;The traditional approach is CSS selectors. Find the element, grab the text. Works until the site redesigns and your &lt;code&gt;product-price&lt;/code&gt; class becomes &lt;code&gt;pdp-price-container&lt;/code&gt;. Your pipeline breaks at 3am.&lt;/p&gt;

&lt;p&gt;webclaw's &lt;code&gt;/v1/extract&lt;/code&gt; endpoint takes a different approach. You define a JSON schema of what you want. The engine fetches the page, cleans it, and uses an LLM to extract the matching fields.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.webclaw.io/v1/extract &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_API_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://store.example.com/product/headphones",
    "schema": {
      "type": "object",
      "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "in_stock": {"type": "boolean"},
        "rating": {"type": "number"}
      }
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"product_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sony WH-1000XM5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;279.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"currency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"USD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"in_stock"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"rating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;4.7&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same schema works on any product page regardless of their frontend framework. The site can redesign completely and extraction still works because you're extracting meaning, not DOM positions.&lt;/p&gt;

&lt;p&gt;If you don't want to define a schema upfront, you can use a plain English prompt instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://company.com/about"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Find the founding year, number of employees, and what the company does"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read more: &lt;a href="https://webclaw.io/blog/extract-structured-data-from-any-webpage" rel="noopener noreferrer"&gt;Extract structured data from any webpage&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a RAG pipeline with live web data
&lt;/h2&gt;

&lt;p&gt;Most RAG tutorials show you how to upload a PDF and ask questions. That's a demo, not a product. Real applications need live data. Documentation gets updated. Pricing changes. Blog posts get published.&lt;/p&gt;

&lt;p&gt;A RAG pipeline with web data has four steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Fetch the page.&lt;/strong&gt; Half the web is behind Cloudflare or JavaScript rendering. webclaw handles TLS fingerprinting and JS rendering automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Extract the content.&lt;/strong&gt; This is where most pipelines fail. Bad extraction means noisy embeddings. Noisy embeddings mean irrelevant retrieval. webclaw's LLM format gives you clean content with zero noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Chunk and embed.&lt;/strong&gt; Since webclaw returns markdown, you can split on headings for semantically coherent chunks instead of arbitrary character counts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;split_by_headings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_chunk&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\n(?=#{1,3} )&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max_chunk&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;paragraphs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max_chunk&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
                &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;section&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Keep it fresh.&lt;/strong&gt; webclaw's &lt;code&gt;/v1/diff&lt;/code&gt; endpoint tracks content changes between snapshots. Crawl your sources on a schedule, diff against the last version, only re-embed pages that actually changed. No wasted compute.&lt;/p&gt;

&lt;p&gt;For bulk ingestion, &lt;code&gt;/v1/crawl&lt;/code&gt; discovers all pages on a site and &lt;code&gt;/v1/batch&lt;/code&gt; extracts them in parallel.&lt;/p&gt;

&lt;p&gt;Read the full guide: &lt;a href="https://webclaw.io/blog/rag-pipeline-web-data" rel="noopener noreferrer"&gt;Build a RAG pipeline with live web data&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP: give your AI agent web access
&lt;/h2&gt;

&lt;p&gt;MCP (Model Context Protocol) is an open standard that lets AI models call external tools. Think of it like USB for AI. One protocol, any tool, any model.&lt;/p&gt;

&lt;p&gt;webclaw ships an MCP server with 8 tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;scrape&lt;/strong&gt; — read any URL, get clean content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;crawl&lt;/strong&gt; — follow links across a site, extract everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;search&lt;/strong&gt; — web search and scrape results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;map&lt;/strong&gt; — discover all URLs on a site via sitemap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;extract&lt;/strong&gt; — structured data with a JSON schema&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;summarize&lt;/strong&gt; — condense a page to key points&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;diff&lt;/strong&gt; — detect content changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;brand&lt;/strong&gt; — extract colors, fonts, logos&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Set it up in Claude Desktop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"webclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"webclaw-mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or auto-configure for Claude, Cursor, Windsurf, Codex:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx create-webclaw
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now your AI agent can read any URL during a conversation. You ask "compare the pricing of these three SaaS tools" and the agent scrapes each pricing page, extracts the data, and builds a comparison table. No custom code.&lt;/p&gt;

&lt;p&gt;The MCP SDK crossed 97 million monthly downloads. This is not experimental anymore. Claude Desktop, Claude Code, Cursor, Windsurf, and OpenAI all support it.&lt;/p&gt;

&lt;p&gt;Read more: &lt;a href="https://webclaw.io/blog/mcp-and-web-scraping" rel="noopener noreferrer"&gt;MCP and Web Scraping for AI Agents&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Content monitoring and change detection
&lt;/h2&gt;

&lt;p&gt;If you're tracking competitors, monitoring documentation, or keeping a knowledge base fresh, you need to know when pages change.&lt;/p&gt;

&lt;p&gt;webclaw's &lt;code&gt;/v1/diff&lt;/code&gt; endpoint compares a page against a previous snapshot and tells you exactly what changed. Combine this with &lt;code&gt;/v1/crawl&lt;/code&gt; on a schedule and you have a content monitoring pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Crawl your sources daily&lt;/li&gt;
&lt;li&gt;Diff each page against the last snapshot&lt;/li&gt;
&lt;li&gt;Re-embed only the pages that changed&lt;/li&gt;
&lt;li&gt;Alert on significant changes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is how you keep a RAG pipeline fresh without re-embedding everything on every cycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Web search built in
&lt;/h2&gt;

&lt;p&gt;Sometimes your agent doesn't have a URL. It needs to find information first.&lt;/p&gt;

&lt;p&gt;webclaw's &lt;code&gt;/v1/search&lt;/code&gt; endpoint queries the web and returns results with snippets. Chain it with &lt;code&gt;/v1/scrape&lt;/code&gt; and you go from a query to structured content in two calls.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.webclaw.io/v1/search &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer YOUR_API_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"query": "best rust web frameworks 2026", "num_results": 5}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent searches, picks the most relevant results, scrapes them, and synthesizes an answer. All with live data, not training data from months ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  The full stack
&lt;/h2&gt;

&lt;p&gt;webclaw is a Rust workspace with six crates. The core extraction engine has zero network dependencies and is WASM-safe. The CLI, REST API server, and MCP server are separate binaries built on the same engine.&lt;/p&gt;

&lt;p&gt;Install the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;webclaw
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or pull the Docker image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; ghcr.io/0xmassi/webclaw https://example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cloud API at &lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;webclaw.io&lt;/a&gt; adds JavaScript rendering, anti-bot bypass, LLM extraction, and higher concurrency. Free tier: 500 pages/month, no credit card.&lt;/p&gt;

&lt;p&gt;SDKs for &lt;a href="https://webclaw.io/docs/sdks/python" rel="noopener noreferrer"&gt;Python&lt;/a&gt;, &lt;a href="https://webclaw.io/docs/sdks/typescript" rel="noopener noreferrer"&gt;TypeScript&lt;/a&gt;, and &lt;a href="https://webclaw.io/docs/sdks/go" rel="noopener noreferrer"&gt;Go&lt;/a&gt; are coming soon.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;I'm working on deep research (multi-step web research with LLM synthesis), webhook notifications for content changes, and expanding the MCP toolset.&lt;/p&gt;

&lt;p&gt;If you're building LLM applications that need web data, give it a try. The repo is at &lt;a href="https://github.com/0xMassi/webclaw" rel="noopener noreferrer"&gt;github.com/0xMassi/webclaw&lt;/a&gt;. Star it if it saves you time, open an issue if something breaks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;webclaw.io&lt;/a&gt; | &lt;a href="https://webclaw.io/docs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://discord.gg/KDfd48EpnW" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I built a web scraper in Rust that bypasses Cloudflare without a browser</title>
      <dc:creator>Massi</dc:creator>
      <pubDate>Tue, 24 Mar 2026 11:39:41 +0000</pubDate>
      <link>https://dev.to/0xmassi/i-built-a-web-scraper-in-rust-that-bypasses-cloudflare-without-a-browser-3c1o</link>
      <guid>https://dev.to/0xmassi/i-built-a-web-scraper-in-rust-that-bypasses-cloudflare-without-a-browser-3c1o</guid>
      <description>&lt;p&gt;Every AI agent has the same problem. You ask it to read a webpage and it comes back with a 403, or worse, 5000 tokens of navigation bars and cookie banners.&lt;/p&gt;

&lt;p&gt;I spent the last few months building webclaw to fix this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Try fetching any real website with a standard HTTP client. Most of them will block you. Cloudflare, Akamai, DataDome, they all look at your TLS fingerprint before the request even reaches the server.&lt;/p&gt;

&lt;p&gt;The usual fix is spinning up a headless Chrome. That works, but now you need 500MB of browser, it takes 2-3 seconds per page, and you still get all the HTML noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  What webclaw does differently
&lt;/h2&gt;

&lt;p&gt;Instead of launching a browser, webclaw impersonates one at the TLS level. The TCP handshake, cipher suites, extensions, everything looks like Chrome 142. Most anti-bot systems pass the request through because the fingerprint is already valid.&lt;/p&gt;

&lt;p&gt;Then the extraction engine scores every DOM node by text density, semantic tags, and link ratio. Navigation, ads, footers, cookie banners get stripped. What comes out is clean markdown.&lt;/p&gt;

&lt;p&gt;A real example: a news article that is 4,820 tokens as raw HTML becomes 1,590 tokens after webclaw processes it. Same content, 67% less tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;webclaw is a Rust workspace with 6 crates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;webclaw-core    pure extraction, zero network deps, WASM-safe
webclaw-fetch   HTTP + TLS fingerprinting via primp
webclaw-llm     LLM provider chain (Ollama &amp;gt; OpenAI &amp;gt; Anthropic)
webclaw-pdf     PDF text extraction
webclaw-cli     CLI binary
webclaw-mcp     MCP server for AI agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The split between core and fetch was intentional. webclaw-core takes a &lt;code&gt;&amp;amp;str&lt;/code&gt; of HTML and returns structured output. No I/O, no network calls, no allocator tricks. It should compile to WASM without changes.&lt;/p&gt;

&lt;p&gt;Extraction speed on the core alone (no network):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Page size&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 KB&lt;/td&gt;
&lt;td&gt;0.8ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 KB&lt;/td&gt;
&lt;td&gt;3.2ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500 KB&lt;/td&gt;
&lt;td&gt;12.1ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How to use it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CLI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# basic extraction&lt;/span&gt;
webclaw https://example.com

&lt;span class="c"&gt;# different output formats&lt;/span&gt;
webclaw https://example.com &lt;span class="nt"&gt;-f&lt;/span&gt; json
webclaw https://example.com &lt;span class="nt"&gt;-f&lt;/span&gt; llm

&lt;span class="c"&gt;# crawl a docs site&lt;/span&gt;
webclaw https://docs.example.com &lt;span class="nt"&gt;--crawl&lt;/span&gt; &lt;span class="nt"&gt;--depth&lt;/span&gt; 2

&lt;span class="c"&gt;# extract structured data with LLM&lt;/span&gt;
webclaw https://example.com &lt;span class="nt"&gt;--extract-prompt&lt;/span&gt; &lt;span class="s2"&gt;"get all pricing tiers"&lt;/span&gt;

&lt;span class="c"&gt;# track page changes&lt;/span&gt;
webclaw https://example.com &lt;span class="nt"&gt;-f&lt;/span&gt; json &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; snapshot.json
webclaw https://example.com &lt;span class="nt"&gt;--diff-with&lt;/span&gt; snapshot.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  MCP server (for Claude, Cursor, Windsurf, Codex)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx create-webclaw
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command. It detects what AI tools you have installed and writes the config for each one. After restart you get 10 tools: scrape, crawl, search, extract, summarize, brand, diff, map, batch, research.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; ghcr.io/0xmassi/webclaw https://example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;128 MB image. Works on any machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;Tested on 50 real pages across news sites, documentation, e-commerce, SPAs, and blogs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;webclaw&lt;/th&gt;
&lt;th&gt;readability&lt;/th&gt;
&lt;th&gt;trafilatura&lt;/th&gt;
&lt;th&gt;newspaper3k&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Extraction accuracy&lt;/td&gt;
&lt;td&gt;95.1%&lt;/td&gt;
&lt;td&gt;83%&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;td&gt;66%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noise removal&lt;/td&gt;
&lt;td&gt;96.1%&lt;/td&gt;
&lt;td&gt;79%&lt;/td&gt;
&lt;td&gt;73%&lt;/td&gt;
&lt;td&gt;61%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The biggest wins are on JavaScript heavy sites. When the visible DOM is empty because content is in embedded JSON (Next.js, React SSR payloads), webclaw has a data island extractor that pulls content from &lt;code&gt;__NEXT_DATA__&lt;/code&gt;, &lt;code&gt;window.__data&lt;/code&gt;, and similar patterns. Most other tools return nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned building this
&lt;/h2&gt;

&lt;p&gt;TLS fingerprinting is fragile. Chrome updates their cipher suites every few versions and you have to keep up. I am using primp, which maintains patched forks of rustls, hyper, and h2. It works well but it is a maintenance burden. If Chrome ships a new TLS extension tomorrow, requests start getting blocked until the forks are updated.&lt;/p&gt;

&lt;p&gt;The extraction scoring took the most iteration. Early versions were too aggressive and would strip content that looked like navigation (short paragraphs with links). The fix was a semantic bonus system: nodes inside &lt;code&gt;&amp;lt;article&amp;gt;&lt;/code&gt; or &lt;code&gt;&amp;lt;main&amp;gt;&lt;/code&gt; tags get a score boost, nodes with content-related class names get another boost. Combined with link density penalties, it handles most layouts without site-specific rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;MIT licensed, fully open source.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/0xMassi/webclaw" rel="noopener noreferrer"&gt;https://github.com/0xMassi/webclaw&lt;/a&gt;&lt;br&gt;
Website: &lt;a href="https://webclaw.io" rel="noopener noreferrer"&gt;https://webclaw.io&lt;/a&gt;&lt;br&gt;
Discord: &lt;a href="https://discord.gg/KDfd48EpnW" rel="noopener noreferrer"&gt;https://discord.gg/KDfd48EpnW&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you run into a site that webclaw fails on, open an issue. Every edge case makes the extraction better.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>ai</category>
      <category>opensource</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Shipping a Production macOS App with Tauri 2.0: Code Signing, Notarization, and Homebrew</title>
      <dc:creator>Massi</dc:creator>
      <pubDate>Mon, 09 Feb 2026 10:53:54 +0000</pubDate>
      <link>https://dev.to/0xmassi/shipping-a-production-macos-app-with-tauri-20-code-signing-notarization-and-homebrew-mc3</link>
      <guid>https://dev.to/0xmassi/shipping-a-production-macos-app-with-tauri-20-code-signing-notarization-and-homebrew-mc3</guid>
      <description>&lt;p&gt;There are plenty of tutorials on building a Tauri app. Very few tell you what happens after &lt;code&gt;npm run tauri build&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I recently shipped &lt;a href="https://github.com/0xMassi/stik_app" rel="noopener noreferrer"&gt;Stik&lt;/a&gt;, a note-capture app for macOS built with Tauri 2.0. The app itself took a few days to build. Getting it properly signed, notarized, distributed through Homebrew, and auto-updating took longer than I expected.&lt;/p&gt;

&lt;p&gt;This post covers everything I learned. If you're building a Tauri app and plan to ship it to real users on macOS, this should save you a few days of pain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;You've built your Tauri app. It runs great in &lt;code&gt;tauri dev&lt;/code&gt;. You run &lt;code&gt;tauri build&lt;/code&gt; and get a &lt;code&gt;.dmg&lt;/code&gt;. You send it to a friend. They open it and macOS says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"App is damaged and can't be opened. You should move it to the Trash."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's because your app isn't code signed or notarized. Apple requires both for any app distributed outside the App Store. Without them, macOS Gatekeeper blocks your app on every machine except yours.&lt;/p&gt;

&lt;p&gt;This is where most Tauri tutorials stop and most developers get stuck.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you actually need
&lt;/h2&gt;

&lt;p&gt;Getting a Tauri app to users on macOS requires four things beyond building the binary:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Code signing&lt;/strong&gt;: proves the app comes from a verified developer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notarization&lt;/strong&gt;: Apple scans the binary for malware and issues a ticket&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distribution&lt;/strong&gt;: a way for users to install it (Homebrew, DMG, or both)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-updates&lt;/strong&gt;: so users don't get stuck on old versions forever&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's go through each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Apple Developer setup
&lt;/h2&gt;

&lt;p&gt;You need an &lt;a href="https://developer.apple.com/" rel="noopener noreferrer"&gt;Apple Developer account&lt;/a&gt; ($99/year). There's no way around this for distribution outside the App Store.&lt;/p&gt;

&lt;p&gt;Once enrolled, you need two things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Developer ID Application certificate.&lt;/strong&gt; Go to Certificates, Identifiers &amp;amp; Profiles in your developer account. Create a "Developer ID Application" certificate. Download it and install it in your Keychain. This is what signs your app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An app-specific password.&lt;/strong&gt; Go to &lt;a href="https://appleid.apple.com" rel="noopener noreferrer"&gt;appleid.apple.com&lt;/a&gt;, sign in, and generate an app-specific password under Security. This is used by the notarization tool to authenticate with Apple's servers.&lt;/p&gt;

&lt;p&gt;Export your signing certificate as a &lt;code&gt;.p12&lt;/code&gt; file from Keychain Access. You'll need it for CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Configure Tauri for signing
&lt;/h2&gt;

&lt;p&gt;In your &lt;code&gt;tauri.conf.json&lt;/code&gt;, make sure the bundle identifier is set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bundle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"identifier"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ink.stik.app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"macOS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"signingIdentity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Developer ID Application: Your Name (TEAMID)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"entitlements"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"./Entitlements.plist"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create an &lt;code&gt;Entitlements.plist&lt;/code&gt; in your &lt;code&gt;src-tauri/&lt;/code&gt; directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd"&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;plist&lt;/span&gt; &lt;span class="na"&gt;version=&lt;/span&gt;&lt;span class="s"&gt;"1.0"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;dict&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;key&amp;gt;&lt;/span&gt;com.apple.security.cs.allow-jit&lt;span class="nt"&gt;&amp;lt;/key&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;true/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;key&amp;gt;&lt;/span&gt;com.apple.security.cs.allow-unsigned-executable-memory&lt;span class="nt"&gt;&amp;lt;/key&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;true/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;key&amp;gt;&lt;/span&gt;com.apple.security.cs.allow-dyld-environment-variables&lt;span class="nt"&gt;&amp;lt;/key&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;true/&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dict&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/plist&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These entitlements are needed because Tauri uses a WebView that requires JIT compilation. Without them, the app will crash on launch after notarization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: The CI/CD pipeline
&lt;/h2&gt;

&lt;p&gt;This is where it all comes together. One GitHub Actions workflow, triggered by a git tag, does everything:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Builds the Swift sidecar (if you have one) as a universal binary&lt;/li&gt;
&lt;li&gt;Builds the Tauri app for both &lt;code&gt;aarch64-apple-darwin&lt;/code&gt; and &lt;code&gt;x86_64-apple-darwin&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Signs the binary with your Developer ID&lt;/li&gt;
&lt;li&gt;Submits it to Apple for notarization&lt;/li&gt;
&lt;li&gt;Uploads the signed &lt;code&gt;.dmg&lt;/code&gt; to GitHub Releases&lt;/li&gt;
&lt;li&gt;Updates the Homebrew tap&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's the structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Release&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;v*'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build-and-release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;macos-latest&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;submodules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;recursive&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Setup Node.js&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-node@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
          &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Setup Rust&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dtolnay/rust-toolchain@stable&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aarch64-apple-darwin,x86_64-apple-darwin&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rust cache&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;swatinem/rust-cache@v2&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;workspaces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;src-tauri&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Import Apple signing certificate&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_CERTIFICATE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_CERTIFICATE }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_CERTIFICATE_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_CERTIFICATE_PASSWORD }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;CERT_PATH=$RUNNER_TEMP/certificate.p12&lt;/span&gt;
          &lt;span class="s"&gt;KEYCHAIN_PATH=$RUNNER_TEMP/build.keychain&lt;/span&gt;

          &lt;span class="s"&gt;echo "$APPLE_CERTIFICATE" | base64 --decode &amp;gt; "$CERT_PATH"&lt;/span&gt;

          &lt;span class="s"&gt;security create-keychain -p "" "$KEYCHAIN_PATH"&lt;/span&gt;
          &lt;span class="s"&gt;security set-keychain-settings -lut 21600 "$KEYCHAIN_PATH"&lt;/span&gt;
          &lt;span class="s"&gt;security unlock-keychain -p "" "$KEYCHAIN_PATH"&lt;/span&gt;
          &lt;span class="s"&gt;security import "$CERT_PATH" -P "$APPLE_CERTIFICATE_PASSWORD" -A -t cert -f pkcs12 -k "$KEYCHAIN_PATH"&lt;/span&gt;
          &lt;span class="s"&gt;security set-key-partition-list -S apple-tool:,apple: -k "" "$KEYCHAIN_PATH"&lt;/span&gt;
          &lt;span class="s"&gt;security list-keychains -d user -s "$KEYCHAIN_PATH" login.keychain&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build DarwinKit universal binary&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;cd src-tauri/darwinkit&lt;/span&gt;
          &lt;span class="s"&gt;swift build -c release --arch arm64 --arch x86_64&lt;/span&gt;
          &lt;span class="s"&gt;mkdir -p ../binaries&lt;/span&gt;
          &lt;span class="s"&gt;BINARY=$(find .build -name darwinkit -type f -perm +111 | grep -i release | head -1)&lt;/span&gt;
          &lt;span class="s"&gt;echo "Found binary at: $BINARY"&lt;/span&gt;
          &lt;span class="s"&gt;cp "$BINARY" ../binaries/darwinkit-aarch64-apple-darwin&lt;/span&gt;
          &lt;span class="s"&gt;cp "$BINARY" ../binaries/darwinkit-x86_64-apple-darwin&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install npm dependencies&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm ci&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and release (aarch64)&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tauri-apps/tauri-action@v0&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_CERTIFICATE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_CERTIFICATE }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_CERTIFICATE_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_CERTIFICATE_PASSWORD }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_SIGNING_IDENTITY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_SIGNING_IDENTITY }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_TEAM_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_TEAM_ID }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_ID }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_PASSWORD }}&lt;/span&gt;
          &lt;span class="na"&gt;TAURI_SIGNING_PRIVATE_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.TAURI_SIGNING_PRIVATE_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;TAURI_SIGNING_PRIVATE_KEY_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.TAURI_SIGNING_PRIVATE_KEY_PASSWORD }}&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;tagName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v__VERSION__&lt;/span&gt;
          &lt;span class="na"&gt;releaseName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Stik&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;v__VERSION__'&lt;/span&gt;
          &lt;span class="na"&gt;releaseBody&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;See&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;[CHANGELOG](https://github.com/0xMassi/stik_app/blob/main/CHANGELOG.md)&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;details.'&lt;/span&gt;
          &lt;span class="na"&gt;releaseDraft&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;prerelease&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--target aarch64-apple-darwin&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and release (x86_64)&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tauri-apps/tauri-action@v0&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_CERTIFICATE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_CERTIFICATE }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_CERTIFICATE_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_CERTIFICATE_PASSWORD }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_SIGNING_IDENTITY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_SIGNING_IDENTITY }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_TEAM_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_TEAM_ID }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_ID }}&lt;/span&gt;
          &lt;span class="na"&gt;APPLE_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.APPLE_PASSWORD }}&lt;/span&gt;
          &lt;span class="na"&gt;TAURI_SIGNING_PRIVATE_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.TAURI_SIGNING_PRIVATE_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;TAURI_SIGNING_PRIVATE_KEY_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.TAURI_SIGNING_PRIVATE_KEY_PASSWORD }}&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;tagName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v__VERSION__&lt;/span&gt;
          &lt;span class="na"&gt;releaseName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Stik&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;v__VERSION__'&lt;/span&gt;
          &lt;span class="na"&gt;releaseDraft&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;prerelease&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--target x86_64-apple-darwin&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update Homebrew tap&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success()&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
          &lt;span class="na"&gt;HOMEBREW_TAP_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.HOMEBREW_TAP_TOKEN }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;VERSION="${GITHUB_REF_NAME#v}"&lt;/span&gt;

          &lt;span class="s"&gt;# Download DMGs using repo-scoped GITHUB_TOKEN&lt;/span&gt;
          &lt;span class="s"&gt;GH_TOKEN="$GITHUB_TOKEN" gh release download "$GITHUB_REF_NAME" --pattern "*.dmg" --dir "$RUNNER_TEMP"&lt;/span&gt;
          &lt;span class="s"&gt;SHA_ARM=$(shasum -a 256 "$RUNNER_TEMP/Stik_${VERSION}_aarch64.dmg" | cut -d' ' -f1)&lt;/span&gt;
          &lt;span class="s"&gt;SHA_INTEL=$(shasum -a 256 "$RUNNER_TEMP/Stik_${VERSION}_x64.dmg" | cut -d' ' -f1)&lt;/span&gt;

          &lt;span class="s"&gt;# Generate updated cask formula&lt;/span&gt;
          &lt;span class="s"&gt;cat &amp;gt; "$RUNNER_TEMP/stik.rb" &amp;lt;&amp;lt;CASKEOF&lt;/span&gt;
          &lt;span class="s"&gt;cask "stik" do&lt;/span&gt;
            &lt;span class="s"&gt;arch arm: "aarch64", intel: "x64"&lt;/span&gt;

            &lt;span class="s"&gt;version "${VERSION}"&lt;/span&gt;
            &lt;span class="s"&gt;sha256 arm:   "${SHA_ARM}",&lt;/span&gt;
                   &lt;span class="s"&gt;intel: "${SHA_INTEL}"&lt;/span&gt;

            &lt;span class="s"&gt;url "https://github.com/0xMassi/stik_app/releases/download/v#{version}/Stik_#{version}_#{arch}.dmg"&lt;/span&gt;
            &lt;span class="s"&gt;name "Stik"&lt;/span&gt;
            &lt;span class="s"&gt;desc "Instant thought capture - one shortcut, post-it appears, type, close"&lt;/span&gt;
            &lt;span class="s"&gt;homepage "https://github.com/0xMassi/stik_app"&lt;/span&gt;

            &lt;span class="s"&gt;depends_on macos: "&amp;gt;= :catalina"&lt;/span&gt;

            &lt;span class="s"&gt;app "Stik.app"&lt;/span&gt;

            &lt;span class="s"&gt;zap trash: [&lt;/span&gt;
              &lt;span class="s"&gt;"~/Documents/Stik",&lt;/span&gt;
              &lt;span class="s"&gt;"~/.stik",&lt;/span&gt;
              &lt;span class="s"&gt;"~/Library/Caches/com.stik.app",&lt;/span&gt;
              &lt;span class="s"&gt;"~/Library/WebKit/com.stik.app",&lt;/span&gt;
            &lt;span class="s"&gt;]&lt;/span&gt;
          &lt;span class="s"&gt;end&lt;/span&gt;
          &lt;span class="s"&gt;CASKEOF&lt;/span&gt;

          &lt;span class="s"&gt;# Base64-encode the cask content&lt;/span&gt;
          &lt;span class="s"&gt;CONTENT=$(base64 -i "$RUNNER_TEMP/stik.rb")&lt;/span&gt;

          &lt;span class="s"&gt;# Get current file SHA from GitHub API (needed for update)&lt;/span&gt;
          &lt;span class="s"&gt;FILE_SHA=$(GH_TOKEN="$HOMEBREW_TAP_TOKEN" gh api repos/0xMassi/homebrew-stik/contents/Casks/stik.rb --jq '.sha')&lt;/span&gt;

          &lt;span class="s"&gt;# Push updated cask to tap repo&lt;/span&gt;
          &lt;span class="s"&gt;GH_TOKEN="$HOMEBREW_TAP_TOKEN" gh api repos/0xMassi/homebrew-stik/contents/Casks/stik.rb \&lt;/span&gt;
            &lt;span class="s"&gt;--method PUT \&lt;/span&gt;
            &lt;span class="s"&gt;-f message="Update Stik to v${VERSION}" \&lt;/span&gt;
            &lt;span class="s"&gt;-f sha="$FILE_SHA" \&lt;/span&gt;
            &lt;span class="s"&gt;-f content="$CONTENT"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Trigger landing page rebuild&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success()&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;VERCEL_DEPLOY_HOOK&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.VERCEL_DEPLOY_HOOK }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;curl -s -X POST "$VERCEL_DEPLOY_HOOK"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The secrets you need
&lt;/h3&gt;

&lt;p&gt;In your GitHub repo settings, add these secrets:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Secret&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;APPLE_CERTIFICATE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your &lt;code&gt;.p12&lt;/code&gt; certificate, base64 encoded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;APPLE_CERTIFICATE_PASSWORD&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The password you set when exporting the &lt;code&gt;.p12&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;APPLE_SIGNING_IDENTITY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Developer ID Application: Your Name (TEAMID)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;APPLE_ID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your Apple ID email&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;APPLE_PASSWORD&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The app-specific password from Step 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;APPLE_TEAM_ID&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your 10-character team ID&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;To base64 encode your certificate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; Certificates.p12 | pbcopy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What tauri-action does for you
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;tauri-apps/tauri-action&lt;/code&gt; GitHub Action handles most of the hard work. When you provide the Apple environment variables, it automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Imports the certificate into a temporary keychain on the runner&lt;/li&gt;
&lt;li&gt;Signs the app bundle with your Developer ID&lt;/li&gt;
&lt;li&gt;Submits the app to Apple's notarization service&lt;/li&gt;
&lt;li&gt;Staples the notarization ticket to the &lt;code&gt;.dmg&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Uploads the result to GitHub Releases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This saves you from writing hundreds of lines of &lt;code&gt;codesign&lt;/code&gt; and &lt;code&gt;xcrun notarytool&lt;/code&gt; commands yourself.&lt;/p&gt;

&lt;h3&gt;
  
  
  The sidecar naming problem
&lt;/h3&gt;

&lt;p&gt;If you're using a Swift (or any other) sidecar binary, Tauri expects a very specific naming convention:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src-tauri/binaries/{name}-{target-triple}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src-tauri/binaries/darwinkit-aarch64-apple-darwin
src-tauri/binaries/darwinkit-x86_64-apple-darwin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the name doesn't match exactly, Tauri won't bundle it and you'll get a runtime error when trying to spawn the sidecar. This cost me hours of debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Homebrew distribution
&lt;/h2&gt;

&lt;p&gt;Homebrew is the standard way developers install tools on macOS. Getting your app into Homebrew makes installation a one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--cask&lt;/span&gt; stik
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Creating a Homebrew tap
&lt;/h3&gt;

&lt;p&gt;A tap is a GitHub repository that contains your Homebrew formula. Create a repo named &lt;code&gt;homebrew-{name}&lt;/code&gt; (for example, &lt;code&gt;homebrew-stik&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Inside it, create &lt;code&gt;Casks/stik.rb&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;cask&lt;/span&gt; &lt;span class="s2"&gt;"stik"&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="n"&gt;arch&lt;/span&gt; &lt;span class="ss"&gt;arm: &lt;/span&gt;&lt;span class="s2"&gt;"aarch64"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;intel: &lt;/span&gt;&lt;span class="s2"&gt;"x64"&lt;/span&gt;

  &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="s2"&gt;"0.4.0"&lt;/span&gt;
  &lt;span class="n"&gt;sha256&lt;/span&gt; &lt;span class="ss"&gt;arm:   &lt;/span&gt;&lt;span class="s2"&gt;"SHA256_ARM64_HERE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="ss"&gt;intel: &lt;/span&gt;&lt;span class="s2"&gt;"SHA256_X64_HERE"&lt;/span&gt;

  &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="s2"&gt;"https://github.com/0xMassi/stik_app/releases/download/v&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/Stik_&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;arch&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.dmg"&lt;/span&gt;
  &lt;span class="nb"&gt;name&lt;/span&gt; &lt;span class="s2"&gt;"Stik"&lt;/span&gt;
  &lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="s2"&gt;"Instant thought capture for macOS"&lt;/span&gt;
  &lt;span class="n"&gt;homepage&lt;/span&gt; &lt;span class="s2"&gt;"https://www.stik.ink"&lt;/span&gt;

  &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="s2"&gt;"Stik.app"&lt;/span&gt;

  &lt;span class="n"&gt;zap&lt;/span&gt; &lt;span class="ss"&gt;trash: &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s2"&gt;"~/Documents/Stik"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"~/.stik"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tap vs Homebrew Core
&lt;/h3&gt;

&lt;p&gt;With a tap, users install with &lt;code&gt;brew install 0xMassi/stik/stik&lt;/code&gt;. To get into Homebrew Core (just &lt;code&gt;brew install --cask stik&lt;/code&gt;), you need to meet their &lt;a href="https://docs.brew.sh/Acceptable-Casks" rel="noopener noreferrer"&gt;inclusion criteria&lt;/a&gt;: the app needs to be notable, actively maintained, and have enough users. Start with a tap, submit to Core once you have traction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Auto-updates
&lt;/h2&gt;

&lt;p&gt;If you ship v0.3.0 without an auto-updater, your early users are stuck there forever unless they manually check for updates. I learned this the hard way. I shipped the auto-updater in v0.3.3, which meant my first 100+ users needed a manual update to get it.&lt;/p&gt;

&lt;p&gt;Tauri has a built-in updater plugin. Add it to your &lt;code&gt;Cargo.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;tauri-plugin-updater&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"2"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure it in &lt;code&gt;tauri.conf.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"plugins"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"updater"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"endpoints"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/0xMassi/stik_app/releases/latest/download/latest.json"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"pubkey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YOUR_PUBLIC_KEY_HERE"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate the key pair with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @tauri-apps/cli signer generate &lt;span class="nt"&gt;-w&lt;/span&gt; ~/.tauri/stik.key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store the private key as a GitHub secret (&lt;code&gt;TAURI_SIGNING_PRIVATE_KEY&lt;/code&gt;) and add the public key to &lt;code&gt;tauri.conf.json&lt;/code&gt;. The CI pipeline will automatically sign the update bundle during build.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;latest.json&lt;/code&gt; file is generated by &lt;code&gt;tauri-action&lt;/code&gt; and uploaded to your GitHub Release. It contains the download URL and signature for each platform.&lt;/p&gt;

&lt;p&gt;On the Rust side, check for updates on app launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;tauri_plugin_updater&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;UpdaterExt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nn"&gt;tauri&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Builder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.plugin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;tauri_plugin_updater&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Builder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.build&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.setup&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="nf"&gt;.handle&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="nn"&gt;tauri&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;async_runtime&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="nf"&gt;.updater&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.check&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="nf"&gt;.download_and_install&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;});&lt;/span&gt;
            &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="nf"&gt;.run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;tauri&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nd"&gt;generate_context!&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"error running app"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The update downloads in the background and applies on the next restart. No user interaction needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The full release flow
&lt;/h2&gt;

&lt;p&gt;Here's what happens when I'm ready to ship a new version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Update version in package.json and Cargo.toml&lt;/span&gt;
&lt;span class="c"&gt;# 2. Update CHANGELOG.md&lt;/span&gt;
&lt;span class="c"&gt;# 3. Commit&lt;/span&gt;

git tag v0.4.0
git push origin v0.4.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. From here, everything is automated:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GitHub Actions detects the tag&lt;/li&gt;
&lt;li&gt;Builds the Swift sidecar as a universal binary (arm64 + x86_64)&lt;/li&gt;
&lt;li&gt;Builds the Tauri app for both architectures&lt;/li&gt;
&lt;li&gt;Signs both builds with my Developer ID certificate&lt;/li&gt;
&lt;li&gt;Submits both to Apple for notarization (takes 2-5 minutes)&lt;/li&gt;
&lt;li&gt;Staples the notarization tickets&lt;/li&gt;
&lt;li&gt;Uploads the &lt;code&gt;.dmg&lt;/code&gt; files and &lt;code&gt;latest.json&lt;/code&gt; to a new GitHub Release&lt;/li&gt;
&lt;li&gt;Updates the Homebrew tap with new version and SHA256 hashes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total time: about 15 minutes. Manual steps: one git tag.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things I wish I knew earlier
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Notarization can be slow.&lt;/strong&gt; Apple's notarization service usually takes 2-5 minutes but can sometimes take 15-20 minutes. Your CI workflow needs to handle this. &lt;code&gt;tauri-action&lt;/code&gt; polls automatically, but set a reasonable timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The certificate expires.&lt;/strong&gt; Developer ID certificates are valid for 5 years. Set a calendar reminder. If it expires, your CI pipeline breaks silently and you ship unsigned builds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Universal binaries for sidecars.&lt;/strong&gt; If you have a Swift sidecar, you need to build it as a universal binary (&lt;code&gt;--arch arm64 --arch x86_64&lt;/code&gt;) so it works on both Intel and Apple Silicon Macs. Tauri won't do this for you. It only handles the Rust binary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test the signed build locally first.&lt;/strong&gt; Before setting up CI, do one manual signing and notarization run on your machine. It's much easier to debug when you can see the output directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Sign&lt;/span&gt;
codesign &lt;span class="nt"&gt;--deep&lt;/span&gt; &lt;span class="nt"&gt;--force&lt;/span&gt; &lt;span class="nt"&gt;--verify&lt;/span&gt; &lt;span class="nt"&gt;--verbose&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--sign&lt;/span&gt; &lt;span class="s2"&gt;"Developer ID Application: Your Name (TEAMID)"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--options&lt;/span&gt; runtime &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--entitlements&lt;/span&gt; Entitlements.plist &lt;span class="se"&gt;\&lt;/span&gt;
  target/release/bundle/macos/YourApp.app

&lt;span class="c"&gt;# Notarize&lt;/span&gt;
xcrun notarytool submit target/release/bundle/macos/YourApp.dmg &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--apple-id&lt;/span&gt; you@email.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--password&lt;/span&gt; xxxx-xxxx-xxxx-xxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--team-id&lt;/span&gt; XXXXXXXXXX &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;

&lt;span class="c"&gt;# Staple&lt;/span&gt;
xcrun stapler staple target/release/bundle/macos/YourApp.dmg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Ship the auto-updater from day one.&lt;/strong&gt; Every user who downloads your app before the updater exists becomes a user you can't update automatically. Don't make my mistake.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Entitlements matter.&lt;/strong&gt; If your app crashes right after notarization but works fine unsigned, it's almost certainly an entitlements issue. Tauri's WebView needs JIT and unsigned executable memory permissions. Check the entitlements section above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Was it worth it?
&lt;/h2&gt;

&lt;p&gt;Setting up this pipeline took about two days of trial and error. But since then, every release is a single command. I've shipped 4 versions in a week with zero friction.&lt;/p&gt;

&lt;p&gt;If you're building a Tauri app and planning to distribute it to real users, invest in this infrastructure early. The time you spend on CI/CD pays for itself after the second release.&lt;/p&gt;

&lt;p&gt;The full source is available at &lt;a href="https://github.com/0xMassi/stik_app" rel="noopener noreferrer"&gt;github.com/0xMassi/stik_app&lt;/a&gt;, including the complete GitHub Actions workflow. MIT licensed.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you have questions about any of this, drop a comment or &lt;a href="https://github.com/0xMassi/stik_app/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt; on the repo. Happy to help.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>tauri</category>
      <category>github</category>
      <category>macos</category>
    </item>
  </channel>
</rss>
