I Scanned 8 Popular Sites for Bot Protection — Here's What Actually Stops Scrapers

#webdev #security #scraping #javascript

If you've ever built a web scraper, you know the pain: your code works perfectly in dev, then the moment you point it at a real production site — blocked. 403s, CAPTCHA walls, infinite "checking your browser" loops.

After spending an embarrassing number of hours fighting anti-bot systems (and writing about it here), I got curious about the other side of the table:

If I were defending a site, would I actually know how strong my bot protection is?

Most engineers don't. So I built a tool to find out — and ran it against 8 well-known sites. The results were more revealing than I expected.

What "anti-bot protection" actually means

When people say a site is "protected," they usually mean one of these layers is in play:

Edge WAF / CDN — Cloudflare, Akamai, AWS WAF, Imperva. These fingerprint your TLS handshake and request headers before your code even reaches the app.
Behavioral bot managers — PerimeterX (now HUMAN), DataDome, F5. These score your behavior: mouse movement, timing, navigation patterns.
Challenge layers — reCAPTCHA, hCaptcha, JS challenges. The visible "prove you're human" walls.

Here's the thing most people miss: these leave fingerprints. A site using Cloudflare leaks cf-ray headers. PerimeterX injects telltale cookies. DataDome sets specific response signatures. You don't need to bypass the protection to detect it — the defense announces itself.

What I found scanning real sites

I checked response headers and body signatures across a handful of sites. A few patterns stood out:

E-commerce marketplaces were all over the map. Some had aggressive PerimeterX/DataDome behavioral scoring. Others? Basically wide open — full product data, prices, seller info, all extractable with a standard browser fingerprint.
Sites that felt secure often weren't. A polished UI says nothing about backend protection. I saw consumer brands with zero detectable bot defense.
The strongest defenders layered edge WAF + behavioral scoring + challenges. That combination is genuinely hard — it triggers human-verification friction even for legitimate automation.

The uncomfortable takeaway: a huge number of production sites have no meaningful bot protection at all. If you run one, "we haven't been scraped yet" is not a strategy.

How detection works (the useful part)

You can build a basic detector yourself. The core idea:

function detectFromHeaders(headers) {
  const h = {};
  for (const [k, v] of Object.entries(headers)) h[k.toLowerCase()] = v;

  const found = [];
  if (h['cf-ray'] || h['server']?.includes('cloudflare'))
    found.push('Cloudflare');
  if (h['x-amz-cf-id'] || h['server']?.includes('awselb'))
    found.push('AWS WAF/Shield');
  if (h['server']?.includes('Incapsula'))
    found.push('Imperva/Incapsula');
  // PerimeterX, DataDome, Akamai, F5 each have their own tells...
  return found;
}

Layer on top of that:

Body analysis — challenge pages have recognizable markers (_px, datadome, cf-challenge).
Status patterns — a 403 with a challenge body vs. a clean 200 tells you a lot.
Confidence scoring — combine signals into a single 0–100 "protection score" instead of a binary yes/no.

This is exactly the logic I packaged into a tool.

The free tool

I put this into a small service: paste a URL, get back which anti-bot systems it's running, a protection score (0–100), and prioritized recommendations if you're underprotected.

It's free for a single scan — no signup wall to see the verdict:

👉 Run a free bot-protection scan

It tests for the big nine: Cloudflare, AWS WAF, Imperva, Akamai, PerimeterX, DataDome, F5, reCAPTCHA, and hCaptcha — then tells you where you stand.

If you run a site with real data worth protecting (pricing, inventory, user content), it's worth 30 seconds to see what an attacker would see first.

What I'd love to know

If you scan your own site — what did you get? Drop the protection score in the comments. I'm especially curious how many "0 / no protection" results show up, because in my testing it was way more than I'd have guessed.

And if you're on the offensive side building scrapers: detecting the defense layer first saves enormous time. Know what you're up against before you write a single line of bypass logic.

Built on top of XCrawl for the fetch layer. Detection logic is open to feedback — tell me what signatures I'm missing.