Why most scrapers fail on modern sites (and how we hit 90% success)

#webdev #javascript #ai #python

Why most scrapers fail on modern sites (and how we hit 90% success)

I've been building web scrapers long enough to notice a pattern: most scrapers quietly fail on 30-40% of the web and return nothing useful.

Not an error. Just a 403, or a Cloudflare challenge page that looks like success but isn't, or an empty HTML shell with no content in it.

Here's what's actually happening.

The three things that kill most scrapers

JavaScript rendering

About 60% of the modern web needs JavaScript to render its content. The classic approach — fire an HTTP request, parse the HTML — returns a blank page. The data you wanted was injected by JS after load.

Libraries like requests and axios don't execute JS. If the site needs it, you get nothing.

Bot detection

Sites like LinkedIn, WSJ, Bloomberg, and most e-commerce platforms run fingerprinting before serving content. They check TLS fingerprint, HTTP header order, canvas and WebGL signatures, and whether JavaScript APIs return expected values.

A headless browser with default settings fails most of these. The TLS fingerprint alone is usually enough to get blocked.

IP reputation

Datacenter IPs are blocklisted on sight by most serious platforms. Cloudflare knows every AWS, GCP, and Azure IP range. Requests from these get challenged immediately, even if you pass the fingerprint checks.

What 90% success actually requires

We built a three-tier fallback for anybrowse:

Direct HTTP — Fast, works on about 50% of sites. Pure HTTP with spoofed headers. No JS, returns in under 2 seconds.
Headless browser with anti-detection — For sites that need JS. Handles viewport randomization, navigator property spoofing, and session warming so the browser doesn't look freshly launched. Covers another 30%.
Real browser over residential proxy — The last resort. Routes through a real Chrome instance on a residential IP. This is what gets through Cloudflare's hardest challenges and paywalls. Slow (10-30s), but it works on things the other two can't touch.

The 90% figure comes from production logs across diverse URLs — news sites, e-commerce, social platforms, paywalls. Not benchmarks on example.com.

What still fails

I'll be straight: 10% of the web still beats us. Login-walled content that requires account age and interaction history, sites that CAPTCHA every request, pages that detect residential proxies by behavioral patterns rather than IP.

If a site requires you to be logged in and browsing for 30 minutes before showing content, no scraper solves that cleanly.

Quick test

curl -X POST https://anybrowse.dev/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://techcrunch.com"}'

No key needed for the first 10 requests per day. The response tells you which tier handled it and how long it took.

If you're building an AI agent that needs to read the web, the MCP config is:

{
  "mcpServers": {
    "anybrowse": {
      "command": "npx",
      "args": ["-y", "anybrowse-mcp"]
    }
  }
}

Bot detection keeps getting more aggressive as AI traffic increases. A single-tier scraper that worked fine two years ago now fails on a big chunk of the web. A fallback chain that tries multiple approaches is the only thing that keeps success rates above 50%.