<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zee</title>
    <description>The latest articles on DEV Community by Zee (@zee_builds).</description>
    <link>https://dev.to/zee_builds</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875644%2F87032ac1-a71e-4e82-9eb9-db42127d93d2.png</url>
      <title>DEV Community: Zee</title>
      <link>https://dev.to/zee_builds</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zee_builds"/>
    <language>en</language>
    <item>
      <title>Stop pretending your scraper worked: honest JSON for AI agents</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Mon, 01 Jun 2026 18:39:57 +0000</pubDate>
      <link>https://dev.to/zee_builds/stop-pretending-your-scraper-worked-honest-json-for-ai-agents-1bm3</link>
      <guid>https://dev.to/zee_builds/stop-pretending-your-scraper-worked-honest-json-for-ai-agents-1bm3</guid>
      <description>&lt;p&gt;Most scraper demos lie by accident.&lt;/p&gt;

&lt;p&gt;They show the happy path: one URL, one clean page, one neat JSON object. Then the first real user tries a marketplace search page, a login wall, a JavaScript shell, a rate-limited product page, or a site that serves different HTML to every fetch path.&lt;/p&gt;

&lt;p&gt;The response still comes back as JSON, so everyone relaxes.&lt;/p&gt;

&lt;p&gt;That is the trap. A JSON response is not the same thing as a useful extraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure mode agents hate
&lt;/h2&gt;

&lt;p&gt;AI agents do not just need scraped text. They need to know what happened.&lt;/p&gt;

&lt;p&gt;Bad extraction output looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Example product"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$29.99"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"availability"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"in stock"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That looks fine until you inspect the source and discover the page was a login prompt, a bot challenge, or a thin JavaScript shell. The extractor filled the schema because the schema was requested. Helpful. Like a smoke alarm that hums a little song while the kitchen burns.&lt;/p&gt;

&lt;p&gt;Better extraction output separates the data from the confidence and the failure class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"failure_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"login_required"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"extracted"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"evidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"final_url_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"restricted_page"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"visible_content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"login prompt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"structured_data_found"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"next_step"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Use an authorised source, public item URL, feed, API, or sample HTML."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is less flashy. It is also much more useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The useful contract is not “scrape anything”
&lt;/h2&gt;

&lt;p&gt;“Scrape anything” is usually a warning label wearing lipstick.&lt;/p&gt;

&lt;p&gt;For agent workflows, the better contract is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Return structured data when the page provides enough evidence.&lt;/li&gt;
&lt;li&gt;Return a specific honest failure when it does not.&lt;/li&gt;
&lt;li&gt;Preserve enough metadata for the caller to decide what to do next.&lt;/li&gt;
&lt;li&gt;Never invent fields just because a prompt asked nicely.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This matters for ecommerce, lead enrichment, price monitoring, competitor tracking, procurement, and internal research agents. If the agent cannot tell the difference between “product unavailable”, “page blocked”, “login required”, and “the parser guessed”, it will make bad decisions with a straight face.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I mean by honest failure
&lt;/h2&gt;

&lt;p&gt;An honest extraction system should classify common failures explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;login_required&lt;/code&gt;: public fetch reached a sign-in wall.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;captcha_required&lt;/code&gt;: the target presented a challenge.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;access_denied&lt;/code&gt;: the target refused access.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;thin_public_content&lt;/code&gt;: the visible public page does not contain enough useful data.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;not_found&lt;/code&gt;: the page genuinely appears missing.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timeout&lt;/code&gt;: the target or render path did not finish within budget.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;unsupported_source&lt;/code&gt;: the input is outside the allowed fetch policy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one matters. Some sources need permission, an account, a feed, a partnership, or a customer-provided export. Pretending otherwise is how “automation” turns into reputation damage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is an MCP problem too
&lt;/h2&gt;

&lt;p&gt;MCP makes it easier for agents to call tools. Good.&lt;/p&gt;

&lt;p&gt;It also makes it easier for agents to call bad tools confidently. Less good.&lt;/p&gt;

&lt;p&gt;If an MCP tool says “extract product data from this page”, the caller needs more than a blob of text. It needs a result shape that tells the agent whether the answer is safe to use.&lt;/p&gt;

&lt;p&gt;A decent MCP extraction response should expose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mode used, such as static HTML, browser render, deterministic parser, or LLM-assisted extraction,&lt;/li&gt;
&lt;li&gt;confidence,&lt;/li&gt;
&lt;li&gt;item counts,&lt;/li&gt;
&lt;li&gt;failure type when blocked,&lt;/li&gt;
&lt;li&gt;whether the result came from visible page evidence,&lt;/li&gt;
&lt;li&gt;a bounded next step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives the agent a decision boundary. Without it, the agent just spreads the lie downstream, but faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical pattern
&lt;/h2&gt;

&lt;p&gt;For production extraction, I like this rough flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetch the page through the cheapest safe path.&lt;/li&gt;
&lt;li&gt;Classify obvious blocks before asking an LLM anything.&lt;/li&gt;
&lt;li&gt;Try deterministic parsing for known structures: JSON-LD, tables, product cards, metadata, feeds.&lt;/li&gt;
&lt;li&gt;Use browser rendering only when the page actually needs it.&lt;/li&gt;
&lt;li&gt;Ask an LLM to structure evidence, not to hallucinate missing evidence.&lt;/li&gt;
&lt;li&gt;Attach confidence and failure metadata.&lt;/li&gt;
&lt;li&gt;Make quota and billing count successful useful work, not random provider attempts.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The boring bits are the product.&lt;/p&gt;

&lt;p&gt;Anyone can make a demo that extracts one page. The hard part is making the system fail in a way the caller can trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example: agent-readable discovery
&lt;/h2&gt;

&lt;p&gt;This also affects how tools get discovered.&lt;/p&gt;

&lt;p&gt;If you run an agent-readable directory or service registry, a vague card that says “web scraper API” is not enough. Agents need to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what input the service accepts,&lt;/li&gt;
&lt;li&gt;what output shape it returns,&lt;/li&gt;
&lt;li&gt;what it refuses to do,&lt;/li&gt;
&lt;li&gt;what authentication is required,&lt;/li&gt;
&lt;li&gt;what a first safe test call looks like,&lt;/li&gt;
&lt;li&gt;what failure classes mean.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why I like service cards over logo walls. A human can infer a lot from branding. Agents need contracts.&lt;/p&gt;

&lt;h2&gt;
  
  
  A small demo path
&lt;/h2&gt;

&lt;p&gt;I am building this into Haunt API, a web extraction API with an MCP surface, and listing it through OpenInvoke, an agent-readable service directory.&lt;/p&gt;

&lt;p&gt;Useful starting points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Haunt MCP web extraction use case: &lt;a href="https://hauntapi.com/use-cases/mcp-server-for-web-scraping?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=seven_day_push_2026_06_day1_devto&amp;amp;utm_content=honest_json_agents" rel="noopener noreferrer"&gt;https://hauntapi.com/use-cases/mcp-server-for-web-scraping?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=seven_day_push_2026_06_day1_devto&amp;amp;utm_content=honest_json_agents&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Haunt docs and demo path: &lt;a href="https://hauntapi.com/docs?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=seven_day_push_2026_06_day1_devto&amp;amp;utm_content=docs_demo" rel="noopener noreferrer"&gt;https://hauntapi.com/docs?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=seven_day_push_2026_06_day1_devto&amp;amp;utm_content=docs_demo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenInvoke agent-readable directory idea: &lt;a href="https://openinvoke.com/agent-readable-api-directory/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=seven_day_push_2026_06_day1_devto&amp;amp;utm_content=service_cards" rel="noopener noreferrer"&gt;https://openinvoke.com/agent-readable-api-directory/?utm_source=devto&amp;amp;utm_medium=article&amp;amp;utm_campaign=seven_day_push_2026_06_day1_devto&amp;amp;utm_content=service_cards&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pitch is deliberately not “bypass every website”. That is not the game.&lt;/p&gt;

&lt;p&gt;The better game is: extract what is legitimately available, say when it is not, and give agents a result they can reason about without pretending blocked pages are products.&lt;/p&gt;

&lt;p&gt;That is less magical. It is also the version that does not poison your workflow.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>mcp</category>
      <category>ai</category>
      <category>api</category>
    </item>
    <item>
      <title>When your web extraction tool should fail loudly instead of returning pretty lies</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Tue, 12 May 2026 19:25:03 +0000</pubDate>
      <link>https://dev.to/zee_builds/when-your-web-extraction-tool-should-fail-loudly-instead-of-returning-pretty-lies-2j0e</link>
      <guid>https://dev.to/zee_builds/when-your-web-extraction-tool-should-fail-loudly-instead-of-returning-pretty-lies-2j0e</guid>
      <description>&lt;p&gt;A web extraction API has one job that sounds boring until it fails:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;return the data that exists, or admit that it could not get it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That second half matters more than most people want to admit.&lt;/p&gt;

&lt;p&gt;When you put an LLM at the end of a scraping pipeline, you get a nasty failure mode. The fetch fails, the page is blocked, the PDF text is empty, or the site returns a CAPTCHA page, and the model still tries to be helpful. Helpful, in this case, means inventing plausible JSON.&lt;/p&gt;

&lt;p&gt;That is worse than a 500.&lt;/p&gt;

&lt;p&gt;A 500 tells your pipeline to retry, route, alert, or skip. Fabricated JSON quietly poisons whatever comes next.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern we ended up using
&lt;/h2&gt;

&lt;p&gt;For Haunt API, the extraction path is deliberately boring before it is clever:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;fetch the page directly&lt;/li&gt;
&lt;li&gt;fall back through stronger fetch/render paths when needed&lt;/li&gt;
&lt;li&gt;inspect what actually came back&lt;/li&gt;
&lt;li&gt;only ask the model to extract when there is real page content&lt;/li&gt;
&lt;li&gt;return a structured failure when the page is inaccessible or clearly a verification wall&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key part is step 3. Do not treat “HTTP 200” as “we got the page”. A lot of sites return a successful status code for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;login walls&lt;/li&gt;
&lt;li&gt;consent walls&lt;/li&gt;
&lt;li&gt;CAPTCHA pages&lt;/li&gt;
&lt;li&gt;JavaScript shells with no meaningful content&lt;/li&gt;
&lt;li&gt;PDF wrappers with empty text&lt;/li&gt;
&lt;li&gt;soft-block pages that look like normal HTML to a naive parser&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you pass that straight to an LLM and ask for product names, prices, company details, or whatever else, you are inviting fiction.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a good failure looks like
&lt;/h2&gt;

&lt;p&gt;A good extraction failure should be boring and machine-readable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"captcha_required"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The target page requires human verification before extraction."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/product"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"empty_content"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The page was reachable, but no extractable content was found."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/report.pdf"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives the caller something useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retry later&lt;/li&gt;
&lt;li&gt;ask for credentials&lt;/li&gt;
&lt;li&gt;use a different source&lt;/li&gt;
&lt;li&gt;mark the record unresolved&lt;/li&gt;
&lt;li&gt;escalate to a human&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it should not do is return:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"company_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Example Holdings Ltd"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"revenue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$12.4M"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"employees"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;when none of that appeared on the page.&lt;/p&gt;

&lt;p&gt;Tiny haunted spreadsheet, now with investor-grade hallucinations. Lovely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Simple guardrails you can add
&lt;/h2&gt;

&lt;p&gt;If you are building this yourself, add checks before extraction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;has_meaningful_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

    &lt;span class="n"&gt;bad_markers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify you are human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checking your browser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;captcha&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enable javascript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;login required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;lowered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marker&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lowered&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bad_markers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is not enough on its own, but it catches a surprising amount of garbage before the model gets a chance to decorate it.&lt;/p&gt;

&lt;p&gt;Also make the model answer from evidence, not vibes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Extract only fields that are explicitly present in the provided page content.
If a field is missing, return null.
If the content is not the requested page, return an extraction_error object.
Do not infer, guess, or fill gaps from general knowledge.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then validate the output. If every field is mysteriously perfect after a weak fetch, be suspicious. The machine is smiling too much.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where MCP makes this sharper
&lt;/h2&gt;

&lt;p&gt;Agent workflows make this problem worse because the output is not always going straight to a human. Claude, Cursor, or another agent may call a tool, receive JSON, and continue planning from it.&lt;/p&gt;

&lt;p&gt;Bad extraction becomes bad reasoning.&lt;/p&gt;

&lt;p&gt;So for MCP tools, I think the contract should be stricter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;return structured JSON when extraction is grounded&lt;/li&gt;
&lt;li&gt;return a structured error when it is not&lt;/li&gt;
&lt;li&gt;expose the failure reason clearly enough for the agent to choose the next step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is what we are building toward with Haunt API: known URL in, natural-language prompt in, structured JSON out, but no pretending when the page cannot actually be read.&lt;/p&gt;

&lt;p&gt;If you are building agents that depend on web data, this is the boring line that matters:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;no data is better than fake data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Haunt API: &lt;a href="https://hauntapi.com" rel="noopener noreferrer"&gt;https://hauntapi.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Python SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;hauntapi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @hauntapi/mcp-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>webscraping</category>
      <category>ai</category>
      <category>mcp</category>
    </item>
    <item>
      <title>The One Lesson I Learned Building a Web Extraction API in 2026</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Fri, 08 May 2026 04:16:13 +0000</pubDate>
      <link>https://dev.to/zee_builds/the-one-lesson-i-learned-building-a-web-extraction-api-in-2026-44f5</link>
      <guid>https://dev.to/zee_builds/the-one-lesson-i-learned-building-a-web-extraction-api-in-2026-44f5</guid>
      <description>&lt;p&gt;I spent the last few months building a web extraction API. Here's what surprised me most: &lt;strong&gt;developers don't need another scraper. They need extraction that stops breaking.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every web scraping thread I read has the same arc:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write a BeautifulSoup/Scrapy scraper&lt;/li&gt;
&lt;li&gt;It works for two weeks&lt;/li&gt;
&lt;li&gt;The target site changes one div&lt;/li&gt;
&lt;li&gt;Scraper breaks at 2am&lt;/li&gt;
&lt;li&gt;Dev swears, rewrites selectors&lt;/li&gt;
&lt;li&gt;Repeat&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The alternative everyone reaches for next: "I'll use Playwright. No, I'll use Puppeteer. No, a headless browser with proxy rotation. No..."&lt;/p&gt;

&lt;p&gt;But here's the thing most people miss: &lt;strong&gt;the problem isn't fetching. It's parsing.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The extraction-first approach
&lt;/h3&gt;

&lt;p&gt;At Haunt API (which I built), we flipped the model. Instead of fetch-then-parse, the user describes what they want in plain English: "Extract product name, price, and stock status from this page."&lt;/p&gt;

&lt;p&gt;The AI reads the page like a human would — it understands context, not CSS selectors. When the site changes layout next week, the extraction still works because the prompt targets meaning, not markup.&lt;/p&gt;

&lt;h3&gt;
  
  
  What matters in 2026
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare bypass&lt;/strong&gt; is table stakes now. If your extraction service can't handle Cloudflare-protected sites, it's a hobby project.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured JSON output&lt;/strong&gt; matters more than markdown. LLMs consume JSON; humans debug with it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed extractions shouldn't cost anything.&lt;/strong&gt; You shouldn't pay for "the page loaded but I couldn't find what you asked for."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Natural language prompts &amp;gt; CSS selectors.&lt;/strong&gt; Site maintainers change divs. They don't change meaning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A practical example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://hauntapi.com/v1/extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-API-Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_key_here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://books.toscrape.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extract all book titles and their prices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# =&amp;gt; [{"title": "A Light in the Attic", "price": "£51.77"}, ...]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's three lines. No selectors. No Playwright. No parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  The real lesson
&lt;/h3&gt;

&lt;p&gt;Building the tool taught me that the web extraction market in 2026 is consolidating around two poles: &lt;strong&gt;platforms&lt;/strong&gt; (Apify, with thousands of pre-built scrapers and scheduling) and &lt;strong&gt;extraction APIs&lt;/strong&gt; (tools that focus on making one extraction call reliable).&lt;/p&gt;

&lt;p&gt;If you're building a product that needs web data, pick the right pole. If you need one-off reliable extraction of specific data points, an extraction-first API will save you more time than another headless browser setup.&lt;/p&gt;

&lt;p&gt;Disclosure: I built Haunt API. Free tier is 100 requests/month if you want to try it: &lt;a href="https://hauntapi.com" rel="noopener noreferrer"&gt;https://hauntapi.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>I’m looking for ugly URLs that break normal scrapers</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Fri, 01 May 2026 21:24:30 +0000</pubDate>
      <link>https://dev.to/zee_builds/im-looking-for-ugly-urls-that-break-normal-scrapers-19o4</link>
      <guid>https://dev.to/zee_builds/im-looking-for-ugly-urls-that-break-normal-scrapers-19o4</guid>
      <description>&lt;p&gt;Most scraper demos use friendly pages.&lt;/p&gt;

&lt;p&gt;A blog post.&lt;br&gt;
A docs page.&lt;br&gt;
A fake ecommerce product.&lt;br&gt;
Something clean enough that BeautifulSoup could probably manage it after a coffee.&lt;/p&gt;

&lt;p&gt;That is not where web extraction gets annoying.&lt;/p&gt;

&lt;p&gt;The annoying cases are the ugly ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JavaScript-rendered pages&lt;/li&gt;
&lt;li&gt;pages with no stable CSS selectors&lt;/li&gt;
&lt;li&gt;pages where the useful data is mixed into layout sludge&lt;/li&gt;
&lt;li&gt;Cloudflare / bot-wall weirdness&lt;/li&gt;
&lt;li&gt;vendor pages where the table changes every week&lt;/li&gt;
&lt;li&gt;docs pages where the answer is spread across several sections&lt;/li&gt;
&lt;li&gt;pages that look simple in a browser but return nonsense to curl&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are the URLs I actually care about.&lt;/p&gt;
&lt;h2&gt;
  
  
  The useful test
&lt;/h2&gt;

&lt;p&gt;The test is not:&lt;/p&gt;

&lt;p&gt;“Can this tool scrape example.com?”&lt;/p&gt;

&lt;p&gt;The test is:&lt;/p&gt;

&lt;p&gt;“Can I send it a real page and ask for the specific thing I need, without writing a custom parser?”&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://hauntapi.com/v1/extract &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"X-API-Key: YOUR_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://example.com/some-awful-page",
    "prompt": "Extract product names, prices, availability, and the source URL as JSON"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the shape I built Haunt API around:&lt;/p&gt;

&lt;p&gt;URL in.&lt;br&gt;
Natural-language extraction prompt in.&lt;br&gt;
Structured JSON out.&lt;/p&gt;

&lt;p&gt;No selector map.&lt;br&gt;
No one-off parser.&lt;br&gt;
No “the site changed one div class and now everything is dead” ritual sacrifice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I want to test next
&lt;/h2&gt;

&lt;p&gt;I’m collecting awkward public URLs that normal scrapers struggle with.&lt;/p&gt;

&lt;p&gt;Not private data.&lt;br&gt;
Not login-only pages.&lt;br&gt;
Not anything illegal or creepy.&lt;/p&gt;

&lt;p&gt;Just the normal developer pain pile:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;public product pages&lt;/li&gt;
&lt;li&gt;public directories&lt;/li&gt;
&lt;li&gt;public docs&lt;/li&gt;
&lt;li&gt;public event listings&lt;/li&gt;
&lt;li&gt;public price pages&lt;/li&gt;
&lt;li&gt;public content pages with messy markup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have one of those “this page should be easy but somehow isn’t” URLs, send it over.&lt;/p&gt;

&lt;p&gt;I’ll try to turn it into clean JSON or Markdown and share what worked / what failed.&lt;/p&gt;

&lt;p&gt;The live docs are here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hauntapi.com/docs" rel="noopener noreferrer"&gt;https://hauntapi.com/docs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the hard-URL proof flow is here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hauntapi.com/services" rel="noopener noreferrer"&gt;https://hauntapi.com/services&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m mainly interested in the failures. Friendly demos are cheap. Broken real pages are where the bodies are buried.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>api</category>
      <category>llm</category>
      <category>automation</category>
    </item>
    <item>
      <title>Your SaaS cancellation page is where retention goes to die</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Fri, 01 May 2026 21:21:43 +0000</pubDate>
      <link>https://dev.to/zee_builds/your-saas-cancellation-page-is-where-retention-goes-to-die-3k95</link>
      <guid>https://dev.to/zee_builds/your-saas-cancellation-page-is-where-retention-goes-to-die-3k95</guid>
      <description>&lt;p&gt;Most SaaS teams treat churn like a dashboard problem.&lt;/p&gt;

&lt;p&gt;They connect Stripe, stare at monthly churn, maybe add a chart, then wonder why nothing changes.&lt;/p&gt;

&lt;p&gt;That is post-mortem work.&lt;/p&gt;

&lt;p&gt;The customer has already left. The money is already gone. The dashboard is just reading the gravestone.&lt;/p&gt;

&lt;p&gt;The useful moment is earlier: the cancellation page.&lt;/p&gt;

&lt;p&gt;That is the one place where the customer is still present, still logged in, still telling you they are about to leave, and still possibly recoverable.&lt;/p&gt;

&lt;p&gt;Here is the simple teardown I use when looking at a SaaS cancellation flow.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Do you know why they are leaving?
&lt;/h2&gt;

&lt;p&gt;If the page only has a red "cancel subscription" button, you are throwing away the most useful data in the business.&lt;/p&gt;

&lt;p&gt;At minimum, ask for one reason:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;too expensive&lt;/li&gt;
&lt;li&gt;missing feature&lt;/li&gt;
&lt;li&gt;not using it enough&lt;/li&gt;
&lt;li&gt;switched to another tool&lt;/li&gt;
&lt;li&gt;temporary pause&lt;/li&gt;
&lt;li&gt;support/product issue&lt;/li&gt;
&lt;li&gt;other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Do not make it a 20-field survey. That is not research, that is punishment.&lt;/p&gt;

&lt;p&gt;One click is enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Does the save offer match the reason?
&lt;/h2&gt;

&lt;p&gt;This is where most flows go stupid.&lt;/p&gt;

&lt;p&gt;If someone says "too expensive", offer a discount or downgrade.&lt;/p&gt;

&lt;p&gt;If someone says "not using it enough", offer a pause or reminder.&lt;/p&gt;

&lt;p&gt;If someone says "missing feature", show the closest workaround or ask if they want to be told when it ships.&lt;/p&gt;

&lt;p&gt;If someone says "temporary pause", do not beg. Give them a clean pause option.&lt;/p&gt;

&lt;p&gt;A generic "20% off if you stay" offer is better than nothing, but it is still lazy.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Are you saving the subscription or just annoying them?
&lt;/h2&gt;

&lt;p&gt;Dark pattern cancellation flows might reduce churn for five minutes and increase hatred forever.&lt;/p&gt;

&lt;p&gt;Do not hide the cancel button.&lt;br&gt;
Do not add five fake confirmation screens.&lt;br&gt;
Do not make them email support.&lt;br&gt;
Do not trap them.&lt;/p&gt;

&lt;p&gt;A good save flow is clear:&lt;/p&gt;

&lt;p&gt;"You can cancel now, but here is the one relevant option that might fit better."&lt;/p&gt;

&lt;p&gt;That is retention. Not hostage-taking.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Are failed payments mixed up with voluntary churn?
&lt;/h2&gt;

&lt;p&gt;These are different problems.&lt;/p&gt;

&lt;p&gt;A failed card is not the same as someone choosing to leave.&lt;/p&gt;

&lt;p&gt;Failed payment recovery needs dunning, retries, backup payment methods, and clear billing emails.&lt;/p&gt;

&lt;p&gt;Voluntary churn needs reason capture, matching offers, and product feedback loops.&lt;/p&gt;

&lt;p&gt;If your churn dashboard lumps them together, your action plan will be mud.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Can you see what happens after the save attempt?
&lt;/h2&gt;

&lt;p&gt;Track the basics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cancellation started&lt;/li&gt;
&lt;li&gt;reason selected&lt;/li&gt;
&lt;li&gt;offer shown&lt;/li&gt;
&lt;li&gt;offer accepted&lt;/li&gt;
&lt;li&gt;cancellation completed&lt;/li&gt;
&lt;li&gt;saved revenue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you cannot see these steps, you cannot improve the flow. You are guessing in expensive darkness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tiny useful audit
&lt;/h2&gt;

&lt;p&gt;Look at your cancellation page and ask:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What reason would a customer give here?&lt;/li&gt;
&lt;li&gt;What offer would they see next?&lt;/li&gt;
&lt;li&gt;Would that offer actually match the reason?&lt;/li&gt;
&lt;li&gt;Would I personally find this flow fair?&lt;/li&gt;
&lt;li&gt;Can I measure whether it saved anything?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer is mostly "no", the fix is probably not another dashboard.&lt;/p&gt;

&lt;p&gt;It is a better cancellation moment.&lt;/p&gt;

&lt;p&gt;I built SaveMyChurn around this exact idea: catch the customer while they are still in the cancellation flow, ask why they are leaving, and show the right recovery offer instead of just reporting churn after the fact.&lt;/p&gt;

&lt;p&gt;If you want to sanity-check your own flow, the low-friction page is here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://savemychurn.com/cancellation-audit" rel="noopener noreferrer"&gt;https://savemychurn.com/cancellation-audit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No Stripe key needed for the first look. Just use it as a teardown lens before you start handing tools access to billing data.&lt;/p&gt;

&lt;p&gt;And if you do nothing else, add the one-question reason step. Boring, cheap, and annoyingly effective.&lt;/p&gt;

</description>
      <category>product</category>
      <category>saas</category>
      <category>startup</category>
      <category>marketing</category>
    </item>
    <item>
      <title>Most SaaS churn dashboards are post-mortems</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Fri, 01 May 2026 06:51:24 +0000</pubDate>
      <link>https://dev.to/zee_builds/most-saas-churn-dashboards-are-post-mortems-5f9k</link>
      <guid>https://dev.to/zee_builds/most-saas-churn-dashboards-are-post-mortems-5f9k</guid>
      <description>&lt;p&gt;If your churn dashboard only tells you that someone left, it is not a recovery system. It is a gravestone with charts.&lt;/p&gt;

&lt;p&gt;The useful question is not just “what is our churn rate?”&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who is likely to cancel?&lt;/li&gt;
&lt;li&gt;why are they cancelling?&lt;/li&gt;
&lt;li&gt;what save path should they see before the hard exit?&lt;/li&gt;
&lt;li&gt;what failed payments are quietly sitting in Stripe?&lt;/li&gt;
&lt;li&gt;what is a 5% retention improvement worth in actual MRR?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A lot of small SaaS teams already have the raw ingredients:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stripe subscriptions&lt;/li&gt;
&lt;li&gt;cancellation reasons, if they ask for them&lt;/li&gt;
&lt;li&gt;plan and price data&lt;/li&gt;
&lt;li&gt;retry events&lt;/li&gt;
&lt;li&gt;customer usage signals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the cancellation flow is usually written like a legal form:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are you sure you want to cancel?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is not retention. That is a trapdoor.&lt;/p&gt;

&lt;p&gt;A better cancellation flow should branch.&lt;/p&gt;

&lt;p&gt;If the reason is price, offer a downgrade or pause.&lt;/p&gt;

&lt;p&gt;If the reason is temporary budget, offer a timed pause.&lt;/p&gt;

&lt;p&gt;If the reason is missing functionality, capture the feature gap and trigger follow-up.&lt;/p&gt;

&lt;p&gt;If the problem is failed payment, do not treat it like voluntary churn.&lt;/p&gt;

&lt;p&gt;None of this requires a giant customer success department. It needs a simple loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;capture the reason&lt;/li&gt;
&lt;li&gt;match the reason to a recovery path&lt;/li&gt;
&lt;li&gt;measure recovered revenue, not vanity clicks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I built SaveMyChurn around that idea: connect Stripe, detect churn and failed-payment leaks, and trigger personalised retention offers.&lt;/p&gt;

&lt;p&gt;There is a free cancellation audit here if you want to see the rough shape before connecting anything:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://savemychurn.com/cancellation-audit" rel="noopener noreferrer"&gt;https://savemychurn.com/cancellation-audit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And a churn calculator if you just want to see what a few retention points are worth:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://savemychurn.com/churn-rate-calculator" rel="noopener noreferrer"&gt;https://savemychurn.com/churn-rate-calculator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The short version: if churn is only a number on your dashboard, you are already too late.&lt;/p&gt;

</description>
      <category>stripe</category>
    </item>
    <item>
      <title>We Built a Custom Playwright Rendering Pipeline for Our MCP Server</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Fri, 24 Apr 2026 07:10:26 +0000</pubDate>
      <link>https://dev.to/zee_builds/we-built-a-custom-playwright-rendering-pipeline-for-our-mcp-server-5bdo</link>
      <guid>https://dev.to/zee_builds/we-built-a-custom-playwright-rendering-pipeline-for-our-mcp-server-5bdo</guid>
      <description>&lt;h1&gt;
  
  
  We Built a Custom Playwright Rendering Pipeline for Our MCP Server — Here's What We Learned
&lt;/h1&gt;

&lt;p&gt;At Haunt API, we build web extraction tools for AI agents. Our MCP server lets Claude and other AI assistants extract structured data from any URL. Simple enough on paper — fetch a page, parse the HTML, return JSON.&lt;/p&gt;

&lt;p&gt;The problem? Half the internet doesn't want to be fetched.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With "Just Use Playwright"
&lt;/h2&gt;

&lt;p&gt;Most web scraping tutorials go something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that works! For a demo. For a product that real users depend on, it falls apart fast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sites detect headless browsers&lt;/strong&gt; and serve captchas or empty pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPA pages need time to render&lt;/strong&gt; — how long do you wait? 2 seconds? 5? 10?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're burning resources&lt;/strong&gt; loading images, fonts, and CSS when you only need text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every render costs the same&lt;/strong&gt; — no caching, no intelligence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We went through all of these. Here's how we solved each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 1: Don't Use One Tool For Everything
&lt;/h2&gt;

&lt;p&gt;Our pipeline has three tiers, and most requests never hit Playwright:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Direct HTTP&lt;/strong&gt; — Works for ~80% of the web. Fast, cheap, no browser needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FlareSolverr&lt;/strong&gt; — Handles Cloudflare challenges and basic JS rendering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playwright&lt;/strong&gt; — Full browser rendering for JS-heavy SPAs that return empty skeletons.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight: we detect &lt;em&gt;skeleton pages&lt;/em&gt; — HTML that has a &lt;code&gt;&amp;lt;div id="root"&amp;gt;&amp;lt;/div&amp;gt;&lt;/code&gt; but no actual content — and only spin up the browser when we need to. Most pages don't need it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_skeleton_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Detect if HTML is an unrendered JS skeleton.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="c1"&gt;# Strip scripts/styles and check for visible text
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;strip_tags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="c1"&gt;# Common SPA markers
&lt;/span&gt;    &lt;span class="n"&gt;skeleton_markers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;div id=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;div id=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__next&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You need to enable JavaScript&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;marker&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;skeleton_markers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lesson 2: Smart Wait Strategies Beat Fixed Timers
&lt;/h2&gt;

&lt;p&gt;The worst thing about browser automation is the waiting. &lt;code&gt;time.sleep(5)&lt;/code&gt; is either too short (page hasn't loaded) or too long (wasting time on pages that loaded instantly).&lt;/p&gt;

&lt;p&gt;We built three concurrent wait strategies. First one to trigger wins:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content Stability&lt;/strong&gt; — Poll the page's visible text every 200ms. If it hasn't changed for 1 second, the content has loaded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network Idle&lt;/strong&gt; — Wait for no new network requests for 500ms. Good for pages that make API calls after initial load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meaningful Content&lt;/strong&gt; — Wait until the page has at least 500 characters of visible text. Catches pages that load something but aren't done yet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wait_for_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Smart wait — detect when content has actually loaded.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nf"&gt;wait_for_content_stability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;wait_for_network_idle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;wait_for_meaningful_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_when&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FIRST_COMPLETED&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strategy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This cut our average render time from 6 seconds to under 3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 3: Fingerprint Rotation Matters
&lt;/h2&gt;

&lt;p&gt;Headless Chromium has tells. Sites check for them. If every request comes from the same user agent with the same viewport on the same timezone, you get blocked.&lt;/p&gt;

&lt;p&gt;We rotate fingerprints per-URL — same site sees a consistent browser (so cookies and sessions work), but different sites see different browsers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FINGERPRINTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ua&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chrome/120.0 Windows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;viewport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1920&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1080&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;locale&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ua&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chrome/119.0 macOS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;viewport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1440&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;locale&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ua&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chrome/120.0 Linux&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;viewport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1366&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;locale&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# ... 10 total variants
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_fingerprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Deterministic per-URL fingerprint selection.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FINGERPRINTS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;FINGERPRINTS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lesson 4: Block What You Don't Need
&lt;/h2&gt;

&lt;p&gt;When you're extracting text data, images and fonts are dead weight. We block them at the network level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;BLOCKED_RESOURCES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;font&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;media&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;texttrack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beacon&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csp_report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eventsource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;BLOCKED_DOMAINS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google-analytics.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facebook.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doubleclick.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hotjar.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mixpanel.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;segment.io&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# ... 20+ tracking domains
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;BLOCKED_RESOURCES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;BLOCKED_DOMAINS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;continue_&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This cuts HTML payload by 40-60% on most pages, which means faster renders and less RAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 5: Cache Renders, Not Requests
&lt;/h2&gt;

&lt;p&gt;If two users extract data from the same URL within 5 minutes, the page probably hasn't changed. We cache the rendered HTML with a TTL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RenderCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OrderedDict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default_ttl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_ttl&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cached_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ttl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;
            &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache hits return in 0ms. For an API that charges per request, this saves users money &lt;em&gt;and&lt;/em&gt; makes responses instant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Final structure — 6 modules, each with a single job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;playwright-service/
├── server.py          # FastAPI orchestration, browser lifecycle
├── fingerprint.py     # UA/viewport/locale rotation
├── smart_wait.py      # Content stability + network idle detection
├── site_detect.py     # Static vs SPA classification
├── cache.py           # LRU render cache with TTL
└── stealth.py         # Resource blocking + headless detection evasion
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each module is ~100 lines. Easy to test, easy to modify, easy to explain to new contributors.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't reach for the browser first.&lt;/strong&gt; Most pages are server-rendered. Direct HTTP is 10x faster and 100x cheaper.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Wait smarter, not longer.&lt;/strong&gt; Detecting when content has actually loaded saves seconds per request.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Be a moving target.&lt;/strong&gt; Rotating fingerprints and blocking trackers keeps you under the radar.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cache aggressively.&lt;/strong&gt; Web pages don't change every second. A 5-minute render cache saves users money and makes your API feel fast.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build modules, not monoliths.&lt;/strong&gt; Each piece of the pipeline has its own concerns. Keep them separate.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Playwright browser engine is the oven. Everything around it — the routing, the waiting, the caching, the stealth — is the recipe. That's where the actual engineering lives.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're &lt;a href="https://hauntapi.com" rel="noopener noreferrer"&gt;Haunt API&lt;/a&gt; — web extraction built for AI agents. If you're building with Claude, Cursor, or any AI assistant, our &lt;a href="https://hauntapi.com#signup" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; gives your agent the ability to extract data from any URL in one line.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>showdev</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>We Built a Custom Playwright Rendering Pipeline for Our MCP Server — Here is What We Learned</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Mon, 20 Apr 2026 19:23:07 +0000</pubDate>
      <link>https://dev.to/zee_builds/we-built-a-custom-playwright-rendering-pipeline-for-our-mcp-server-here-is-what-we-learned-38d9</link>
      <guid>https://dev.to/zee_builds/we-built-a-custom-playwright-rendering-pipeline-for-our-mcp-server-here-is-what-we-learned-38d9</guid>
      <description>&lt;h1&gt;
  
  
  We Built a Custom Playwright Rendering Pipeline for Our MCP Server — Heres What We Learned
&lt;/h1&gt;

&lt;p&gt;At &lt;a href="https://hauntapi.com" rel="noopener noreferrer"&gt;Haunt API&lt;/a&gt;, we build web extraction tools for AI agents. Our MCP server lets Claude and other AI assistants extract structured data from any URL. Simple enough on paper — fetch a page, parse the HTML, return JSON.&lt;/p&gt;

&lt;p&gt;The problem? Half the internet doesnt want to be fetched.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Just Use Playwright
&lt;/h2&gt;

&lt;p&gt;Most web scraping tutorials go something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that works! For a demo. For a product that real users depend on, it falls apart fast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sites detect headless browsers&lt;/strong&gt; and serve captchas or empty pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPA pages need time to render&lt;/strong&gt; — how long do you wait? 2 seconds? 5? 10?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You are burning resources&lt;/strong&gt; loading images, fonts, and CSS when you only need text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every render costs the same&lt;/strong&gt; — no caching, no intelligence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We went through all of these. Here is how we solved each one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 1: Do Not Use One Tool For Everything
&lt;/h2&gt;

&lt;p&gt;Our pipeline has three tiers, and most requests never hit Playwright:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Direct HTTP&lt;/strong&gt; — Works for approximately 80% of the web. Fast, cheap, no browser needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FlareSolverr&lt;/strong&gt; — Handles Cloudflare challenges and basic JS rendering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playwright&lt;/strong&gt; — Full browser rendering for JS-heavy SPAs that return empty skeletons.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight: we detect skeleton pages — HTML that has an empty root div but no actual content — and only spin up the browser when we need to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 2: Smart Wait Strategies Beat Fixed Timers
&lt;/h2&gt;

&lt;p&gt;The worst thing about browser automation is the waiting. A fixed sleep is either too short or too long. We built three concurrent wait strategies — first one to trigger wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content Stability&lt;/strong&gt; — Poll visible text every 200ms. If unchanged for 1 second, done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Idle&lt;/strong&gt; — Wait for no new requests for 500ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meaningful Content&lt;/strong&gt; — Wait until 500+ chars of visible text exist.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This cut our average render time from 6 seconds to under 3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 3: Fingerprint Rotation Matters
&lt;/h2&gt;

&lt;p&gt;Headless Chromium has tells. We rotate fingerprints per-URL — same site sees a consistent browser, different sites see different browsers. 10 viewport variants across Windows, macOS, and Linux UAs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 4: Block What You Do Not Need
&lt;/h2&gt;

&lt;p&gt;When extracting text data, images and fonts are dead weight. We block them at the network level plus 20+ tracking domains. This cuts HTML payload by 40-60%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 5: Cache Renders, Not Requests
&lt;/h2&gt;

&lt;p&gt;If two users extract data from the same URL within 5 minutes, the page probably has not changed. Cache hits return in 0ms.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Six modules, each with a single job:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;server.py&lt;/strong&gt; — FastAPI orchestration, browser lifecycle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fingerprint.py&lt;/strong&gt; — UA/viewport/locale rotation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;smart_wait.py&lt;/strong&gt; — Content stability + network idle detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;site_detect.py&lt;/strong&gt; — Static vs SPA classification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cache.py&lt;/strong&gt; — LRU render cache with TTL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;stealth.py&lt;/strong&gt; — Resource blocking + headless detection evasion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each module is approximately 100 lines. Easy to test, easy to modify.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Do not reach for the browser first. Most pages are server-rendered.&lt;/li&gt;
&lt;li&gt;Wait smarter, not longer.&lt;/li&gt;
&lt;li&gt;Be a moving target with fingerprint rotation.&lt;/li&gt;
&lt;li&gt;Cache aggressively.&lt;/li&gt;
&lt;li&gt;Build modules, not monoliths.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Playwright browser engine is the oven. Everything around it — the routing, the waiting, the caching, the stealth — is the recipe. That is where the actual engineering lives.&lt;/p&gt;




&lt;p&gt;We are &lt;a href="https://hauntapi.com" rel="noopener noreferrer"&gt;Haunt API&lt;/a&gt; — web extraction built for AI agents. If you are building with Claude, Cursor, or any AI assistant, our &lt;a href="https://hauntapi.com" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt; gives your agent the ability to extract data from any URL.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>scraping</category>
      <category>playwright</category>
    </item>
    <item>
      <title>I Built an AI That Talks People Out of Cancelling Their Subscriptions</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Mon, 20 Apr 2026 15:35:05 +0000</pubDate>
      <link>https://dev.to/zee_builds/i-built-an-ai-that-talks-people-out-of-cancelling-their-subscriptions-2bm8</link>
      <guid>https://dev.to/zee_builds/i-built-an-ai-that-talks-people-out-of-cancelling-their-subscriptions-2bm8</guid>
      <description>&lt;p&gt;Here's the thing about churn: by the time someone clicks "Cancel Subscription", they've already decided. Your generic "Would you like 20% off?" popup is too late and too weak.&lt;/p&gt;

&lt;p&gt;I spent the last month building &lt;a href="https://savemychurn.com" rel="noopener noreferrer"&gt;SaveMyChurn&lt;/a&gt; — an AI-powered churn recovery tool for Stripe SaaS founders. This is how it works, what I learned building it, and why I think most cancellation flows are doing it wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;I was looking at my own Stripe dashboard one day and noticed something: the cancellation flow was the most ignored piece of the entire subscription experience. People pour weeks into onboarding, feature development, marketing — and then the cancel button just... ends things. No conversation. No understanding of why.&lt;/p&gt;

&lt;p&gt;For bootstrapped SaaS founders running £5K-50K MRR, every subscription matters. Losing 5% of your customers a month isn't a statistic — it's the difference between growing and dying.&lt;/p&gt;

&lt;p&gt;The existing tools didn't fit. Churnkey starts at $250/month — that's a significant chunk of revenue when you're small. The cheaper options are just form builders with a discount code at the end. Nobody was actually &lt;em&gt;talking&lt;/em&gt; to the customer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;SaveMyChurn does three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Listens to Stripe in real time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a customer hits cancel, Stripe fires a &lt;code&gt;customer.subscription.deleted&lt;/code&gt; webhook. SaveMyChurn catches it instantly, pulls the subscription metadata, payment history, and plan details, and builds a profile of who's leaving and why.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The webhook handler — this is where it starts
&lt;/span&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/webhooks/stripe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stripe_webhook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;body&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stripe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Webhook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;construct_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stripe-signature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;webhook_secret&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer.subscription.deleted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;subscription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="c1"&gt;# Build subscriber profile from Stripe data
&lt;/span&gt;        &lt;span class="n"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;build_subscriber_profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscription&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Generate AI retention strategy
&lt;/span&gt;        &lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generate_retention_strategy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Send personalised recovery email
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_retention_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Generates a unique retention strategy per subscriber&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the part I'm most proud of. Instead of a static "here's 20% off" flow, an AI strategist analyses the subscriber's behaviour — how long they've been a customer, what plan they're on, their payment history, any support tickets — and creates a genuinely personalised retention offer.&lt;/p&gt;

&lt;p&gt;Someone cancelling after 2 months gets a different approach than someone who's been around for a year. Someone on a basic plan gets a different offer than someone on enterprise. The AI adjusts tone, offer type, discount level, and follow-up timing based on the full context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Follows up automatically&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One email rarely saves a cancellation. SaveMyChurn runs a multi-step sequence — initial offer, follow-up with adjusted terms, final value reminder — spaced over a few days. Each step is informed by whether they opened the previous email, clicked anything, or went silent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tech stack
&lt;/h2&gt;

&lt;p&gt;Keeping it simple and cheap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; backend — async Python, handles webhooks fast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB&lt;/strong&gt; for subscriber profiles and strategy storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis&lt;/strong&gt; for caching and rate limiting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM via API&lt;/strong&gt; for strategy generation — the AI strategist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resend&lt;/strong&gt; for transactional emails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; on a single VPS — the whole thing runs on one machine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM cost per strategy generation is under a penny. When your competitor charges $250/month, that's a ridiculous margin.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pricing model (and why it matters)
&lt;/h2&gt;

&lt;p&gt;I went with a commission model. Monthly fee + a percentage of recovered revenue. The idea is simple: if I don't save you money, I don't make money.&lt;/p&gt;

&lt;p&gt;This was a deliberate choice. Flat-fee tools have an incentive to get you signed up and keep you paying, regardless of results. Commission pricing means I'm motivated to actually recover subscriptions, not just ship a dashboard.&lt;/p&gt;

&lt;p&gt;For founders at the £5K-50K MRR stage, this aligns incentives in a way that $250/month flat fees don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Webhook reliability is everything.&lt;/strong&gt; If you miss a &lt;code&gt;customer.subscription.deleted&lt;/code&gt; event, you miss the entire recovery window. I ended up implementing retry queues and idempotency keys before anything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI strategy &amp;gt; rules engine.&lt;/strong&gt; I initially built a simple rule-based system (if cancel reason = "price" → offer discount). It was okay. The AI strategist that replaced it generates strategies I wouldn't have thought of — bundling features differently, offering plan downgrades instead of discounts, timing follow-ups based on engagement patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One email is never enough.&lt;/strong&gt; The first recovery email has maybe a 15-20% open rate. The follow-up catches another chunk. The third one gets the people who were "going to get around to it." Multi-step sequences doubled recovery rates compared to single emails.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it's at
&lt;/h2&gt;

&lt;p&gt;SaveMyChurn is live and in production. It works end-to-end: Stripe webhook → AI strategy → personalised email sequence → dashboard showing what was saved.&lt;/p&gt;

&lt;p&gt;If you're a bootstrapped SaaS founder on Stripe watching subscriptions slip away, &lt;a href="https://savemychurn.com" rel="noopener noreferrer"&gt;give it a look&lt;/a&gt;. There's a free trial — no credit card required.&lt;/p&gt;

</description>
      <category>saas</category>
      <category>stripe</category>
      <category>ai</category>
      <category>retention</category>
    </item>
    <item>
      <title>Your AI Agent Can't Scrape That Page. Here's How to Fix It.</title>
      <dc:creator>Zee</dc:creator>
      <pubDate>Mon, 20 Apr 2026 15:16:08 +0000</pubDate>
      <link>https://dev.to/zee_builds/your-ai-agent-cant-scrape-that-page-heres-how-to-fix-it-2om7</link>
      <guid>https://dev.to/zee_builds/your-ai-agent-cant-scrape-that-page-heres-how-to-fix-it-2om7</guid>
      <description>&lt;h1&gt;
  
  
  Your AI Agent Can't Scrape That Page. Here's How to Fix It.
&lt;/h1&gt;

&lt;p&gt;You built an AI agent that needs real-time web data. Product prices, news articles, competitor info — whatever it is, you need clean HTML or JSON from a URL.&lt;/p&gt;

&lt;p&gt;So you fire off a &lt;code&gt;requests.get()&lt;/code&gt; and... &lt;strong&gt;403 Forbidden&lt;/strong&gt;. Cloudflare says no.&lt;/p&gt;

&lt;p&gt;Or you get a page, but it's empty — the content loads via JavaScript after the page renders, and your HTTP client never sees it.&lt;/p&gt;

&lt;p&gt;Sound familiar? Let's break down what's happening and how to actually solve it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Your Scraping Fails
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. JavaScript Rendering
&lt;/h3&gt;

&lt;p&gt;Modern sites are SPAs. The HTML you get from a raw HTTP request is a shell — the actual content is loaded by JavaScript after the page mounts. &lt;code&gt;requests&lt;/code&gt;, &lt;code&gt;axios&lt;/code&gt;, &lt;code&gt;fetch&lt;/code&gt; — none of them execute JS.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cloudflare and Bot Detection
&lt;/h3&gt;

&lt;p&gt;Cloudflare fingerprints your connection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TLS fingerprint (does your HTTP client look like a browser?)&lt;/li&gt;
&lt;li&gt;HTTP/2 fingerprint&lt;/li&gt;
&lt;li&gt;Browser behavior (mouse movements, JS execution patterns)&lt;/li&gt;
&lt;li&gt;IP reputation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Regular HTTP clients fail all of these checks.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Complex Layouts
&lt;/h3&gt;

&lt;p&gt;Even when you get the HTML, extracting structured data from it is painful. You write brittle CSS selectors that break on every layout change.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solutions (From Worst to Best)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Selenium/Playwright Headless Browsers
&lt;/h3&gt;

&lt;p&gt;They work... sometimes. But Cloudflare detects headless Chrome. You'll spend more time maintaining anti-detection patches than building your actual product.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rotating Proxies + Custom Headers
&lt;/h3&gt;

&lt;p&gt;Expensive, slow, and fragile. You're playing whack-a-mole with detection rules.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use an API That Handles Everything
&lt;/h3&gt;

&lt;p&gt;This is where tools like &lt;a href="https://hauntapi.com" rel="noopener noreferrer"&gt;Haunt API&lt;/a&gt; come in. It's a web extraction API built specifically for AI agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://hauntapi.com/v1/extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/product/123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get the product name, price, and availability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# {
#   "product_name": "Wireless Headphones Pro",
#   "price": "$79.99",
#   "availability": "In Stock"
# }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. One API call. Cloudflare bypassed, JavaScript rendered, structured data extracted.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works Under the Hood
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Smart fetching&lt;/strong&gt; — tries direct HTTP first, falls back to headless browser with anti-fingerprinting for Cloudflare-protected sites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JavaScript executes&lt;/strong&gt; — SPA content becomes available&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI extracts&lt;/strong&gt; the data you described in your natural language prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean JSON&lt;/strong&gt; returned to your application&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  MCP Server for Claude and Cursor
&lt;/h3&gt;

&lt;p&gt;If you're building with AI agents, Haunt also has an MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"haunt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"@hauntapi/mcp-server"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"HAUNT_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-key"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add that to your Claude Desktop or Cursor config and your AI agent can extract data from any website natively. Zero code.&lt;/p&gt;

&lt;h3&gt;
  
  
  REST API (No SDK Needed)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://hauntapi.com/v1/extract &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-api-key: your-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://news.ycombinator.com",
    "prompt": "Get the top 5 stories with titles, points, and URLs"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Free Tier
&lt;/h2&gt;

&lt;p&gt;100 extractions/month for free. No credit card required. Perfect for prototyping your AI agent before scaling up.&lt;/p&gt;

&lt;p&gt;Paid plans start at £19/mo for 1,000 requests with authenticated scraping and priority support.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Reliability&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw requests&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Low (30%)&lt;/td&gt;
&lt;td&gt;5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Selenium + proxies&lt;/td&gt;
&lt;td&gt;$$$&lt;/td&gt;
&lt;td&gt;Medium (60%)&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haunt API&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;High (95%+)&lt;/td&gt;
&lt;td&gt;5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;If your AI agent needs web data and you're tired of fighting bot detection, try &lt;a href="https://hauntapi.com" rel="noopener noreferrer"&gt;Haunt API&lt;/a&gt;. It handles Cloudflare, JavaScript rendering, and data extraction in a single API call.&lt;/p&gt;

&lt;p&gt;Free to start, built for AI agents and RAG pipelines.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Disclosure: I built Haunt API because I was tired of writing the same scraping infrastructure for every project.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>scraping</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
