How I built a scraper that actually works on Cloudflare sites

#ai #python #webdev #tutorial

I was building a research agent. It needed to read news sites, pull earnings reports, scrape job listings. Three hours in, half my URLs were returning empty strings or Cloudflare challenge pages. Not errors. Just nothing useful.

That is when I realized the scraping ecosystem is mostly broken for anything that is not a static blog.

Why scraping keeps failing

There are three things killing most scrapers right now.

JavaScript rendering. A lot of sites ship an empty HTML shell and hydrate via React or Vue. Fetch the URL directly and you get a div with an id and nothing else.

Bot detection. Cloudflare, PerimeterX, DataDome -- they fingerprint your browser. Missing plugins, wrong screen resolution, suspiciously perfect mouse timing. A vanilla Playwright script fails all of these in about 30 seconds.

IP reputation. Datacenter IPs are flagged before your code even runs. AWS, Hetzner, DigitalOcean -- blocked by default on half the sites worth scraping.

You can fight each of these individually. Or you can just not deal with it.

What I built

anybrowse takes a URL and gives you clean Markdown. That is the whole API.

pip install anybrowse

from anybrowse import AnybrowseClient

client = AnybrowseClient()
result = client.scrape("https://techcrunch.com")
print(result.markdown)

Under the hood it runs patched Chromium with randomized fingerprints, falls back to residential ISP proxies when the first attempt fails, and uses a Firefox-based engine (Camoufox) for sites that specifically profile Chrome. CAPTCHA solving is built in via CapSolver.

For AI agents, there is an MCP server that works out of the box with Claude Desktop, Cursor, and Windsurf:

{
  "mcpServers": {
    "anybrowse": {
      "type": "streamable-http",
      "url": "https://anybrowse.dev/mcp"
    }
  }
}

Your agent gets scrape, crawl, search, batch scrape, and structured extraction tools. The search endpoint goes through Brave Search API so it actually returns results instead of timing out on Google.

Honest numbers

90% success rate on general websites. LinkedIn and Twitter are still hard because they require login for most content. Paywalls are a separate problem that scraping does not solve.

The 10% that fails is mostly aggressive per-request CAPTCHAs and strict login walls. CapSolver helps but it is not magic.