<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: anybrowse</title>
    <description>The latest articles on DEV Community by anybrowse (@anybrowse_dev).</description>
    <link>https://dev.to/anybrowse_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3812407%2F8bd80105-f1d2-4346-9842-5d3f2d987f47.png</url>
      <title>DEV Community: anybrowse</title>
      <link>https://dev.to/anybrowse_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anybrowse_dev"/>
    <language>en</language>
    <item>
      <title>Why most scrapers fail on modern sites (and how we hit 90% success)</title>
      <dc:creator>anybrowse</dc:creator>
      <pubDate>Wed, 11 Mar 2026 23:01:35 +0000</pubDate>
      <link>https://dev.to/anybrowse_dev/why-most-scrapers-fail-on-modern-sites-and-how-we-hit-90-success-li0</link>
      <guid>https://dev.to/anybrowse_dev/why-most-scrapers-fail-on-modern-sites-and-how-we-hit-90-success-li0</guid>
      <description>&lt;h1&gt;
  
  
  Why most scrapers fail on modern sites (and how we hit 90% success)
&lt;/h1&gt;

&lt;p&gt;I've been building web scrapers long enough to notice a pattern: most scrapers quietly fail on 30-40% of the web and return nothing useful.&lt;/p&gt;

&lt;p&gt;Not an error. Just a 403, or a Cloudflare challenge page that looks like success but isn't, or an empty HTML shell with no content in it.&lt;/p&gt;

&lt;p&gt;Here's what's actually happening.&lt;/p&gt;




&lt;h2&gt;
  
  
  The three things that kill most scrapers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;JavaScript rendering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;About 60% of the modern web needs JavaScript to render its content. The classic approach — fire an HTTP request, parse the HTML — returns a blank page. The data you wanted was injected by JS after load.&lt;/p&gt;

&lt;p&gt;Libraries like &lt;code&gt;requests&lt;/code&gt; and &lt;code&gt;axios&lt;/code&gt; don't execute JS. If the site needs it, you get nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bot detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sites like LinkedIn, WSJ, Bloomberg, and most e-commerce platforms run fingerprinting before serving content. They check TLS fingerprint, HTTP header order, canvas and WebGL signatures, and whether JavaScript APIs return expected values.&lt;/p&gt;

&lt;p&gt;A headless browser with default settings fails most of these. The TLS fingerprint alone is usually enough to get blocked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IP reputation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Datacenter IPs are blocklisted on sight by most serious platforms. Cloudflare knows every AWS, GCP, and Azure IP range. Requests from these get challenged immediately, even if you pass the fingerprint checks.&lt;/p&gt;




&lt;h2&gt;
  
  
  What 90% success actually requires
&lt;/h2&gt;

&lt;p&gt;We built a three-tier fallback for anybrowse:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Direct HTTP&lt;/strong&gt; — Fast, works on about 50% of sites. Pure HTTP with spoofed headers. No JS, returns in under 2 seconds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Headless browser with anti-detection&lt;/strong&gt; — For sites that need JS. Handles viewport randomization, navigator property spoofing, and session warming so the browser doesn't look freshly launched. Covers another 30%.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real browser over residential proxy&lt;/strong&gt; — The last resort. Routes through a real Chrome instance on a residential IP. This is what gets through Cloudflare's hardest challenges and paywalls. Slow (10-30s), but it works on things the other two can't touch.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The 90% figure comes from production logs across diverse URLs — news sites, e-commerce, social platforms, paywalls. Not benchmarks on example.com.&lt;/p&gt;




&lt;h2&gt;
  
  
  What still fails
&lt;/h2&gt;

&lt;p&gt;I'll be straight: 10% of the web still beats us. Login-walled content that requires account age and interaction history, sites that CAPTCHA every request, pages that detect residential proxies by behavioral patterns rather than IP.&lt;/p&gt;

&lt;p&gt;If a site requires you to be logged in and browsing for 30 minutes before showing content, no scraper solves that cleanly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick test
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://anybrowse.dev/scrape &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"url": "https://techcrunch.com"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No key needed for the first 10 requests per day. The response tells you which tier handled it and how long it took.&lt;/p&gt;

&lt;p&gt;If you're building an AI agent that needs to read the web, the MCP config is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anybrowse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anybrowse-mcp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;Bot detection keeps getting more aggressive as AI traffic increases. A single-tier scraper that worked fine two years ago now fails on a big chunk of the web. A fallback chain that tries multiple approaches is the only thing that keeps success rates above 50%.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>python</category>
      <category>ai</category>
      <category>javascript</category>
    </item>
    <item>
      <title>How I built a scraper that actually works on Cloudflare sites</title>
      <dc:creator>anybrowse</dc:creator>
      <pubDate>Sun, 08 Mar 2026 07:15:21 +0000</pubDate>
      <link>https://dev.to/anybrowse_dev/how-i-built-a-scraper-that-actually-works-on-cloudflare-sites-89i</link>
      <guid>https://dev.to/anybrowse_dev/how-i-built-a-scraper-that-actually-works-on-cloudflare-sites-89i</guid>
      <description>&lt;p&gt;I was building a research agent. It needed to read news sites, pull earnings reports, scrape job listings. Three hours in, half my URLs were returning empty strings or Cloudflare challenge pages. Not errors. Just nothing useful.&lt;/p&gt;

&lt;p&gt;That is when I realized the scraping ecosystem is mostly broken for anything that is not a static blog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why scraping keeps failing
&lt;/h2&gt;

&lt;p&gt;There are three things killing most scrapers right now.&lt;/p&gt;

&lt;p&gt;JavaScript rendering. A lot of sites ship an empty HTML shell and hydrate via React or Vue. Fetch the URL directly and you get a div with an id and nothing else.&lt;/p&gt;

&lt;p&gt;Bot detection. Cloudflare, PerimeterX, DataDome -- they fingerprint your browser. Missing plugins, wrong screen resolution, suspiciously perfect mouse timing. A vanilla Playwright script fails all of these in about 30 seconds.&lt;/p&gt;

&lt;p&gt;IP reputation. Datacenter IPs are flagged before your code even runs. AWS, Hetzner, DigitalOcean -- blocked by default on half the sites worth scraping.&lt;/p&gt;

&lt;p&gt;You can fight each of these individually. Or you can just not deal with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://anybrowse.dev" rel="noopener noreferrer"&gt;anybrowse&lt;/a&gt; takes a URL and gives you clean Markdown. That is the whole API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;anybrowse&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anybrowse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnybrowseClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AnybrowseClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://techcrunch.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood it runs patched Chromium with randomized fingerprints, falls back to residential ISP proxies when the first attempt fails, and uses a Firefox-based engine (Camoufox) for sites that specifically profile Chrome. CAPTCHA solving is built in via CapSolver.&lt;/p&gt;

&lt;p&gt;For AI agents, there is an MCP server that works out of the box with Claude Desktop, Cursor, and Windsurf:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anybrowse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"streamable-http"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://anybrowse.dev/mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your agent gets scrape, crawl, search, batch scrape, and structured extraction tools. The search endpoint goes through Brave Search API so it actually returns results instead of timing out on Google.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest numbers
&lt;/h2&gt;

&lt;p&gt;90% success rate on general websites. LinkedIn and Twitter are still hard because they require login for most content. Paywalls are a separate problem that scraping does not solve.&lt;/p&gt;

&lt;p&gt;The 10% that fails is mostly aggressive per-request CAPTCHAs and strict login walls. CapSolver helps but it is not magic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;10 free scrapes per day, no signup, no credit card. Credit packs start at $5 for 3,000 scrapes if you need more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://anybrowse.dev" rel="noopener noreferrer"&gt;anybrowse.dev&lt;/a&gt; | &lt;a href="https://github.com/kc23go/anybrowse" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://anybrowse.dev/docs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webdev</category>
      <category>ai</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
