DEV Community

Alex Chen
Alex Chen

Posted on

The Complete Anti-Bot Detection Stack: Every Technique Sites Use to Catch Your Scraper

You've handled CAPTCHAs, rotated proxies, and spoofed your User-Agent. Your scraper still gets blocked. Why?

Because modern anti-bot systems don't rely on a single detection method. They stack multiple layers — and you need to understand all of them to build a scraper that survives in production.

This guide maps out every detection technique sites use today, from basic to advanced.

The Detection Layers

Layer 7 ──── Application Logic
              ├── Rate limiting
              ├── Behavioral analysis
              └── Business logic traps

Layer 6 ──── CAPTCHA Challenges
              ├── reCAPTCHA v2/v3
              ├── hCaptcha / Enterprise
              ├── Cloudflare Turnstile
              └── FunCaptcha

Layer 5 ──── JavaScript Challenges
              ├── Browser fingerprinting
              ├── Canvas/WebGL hashing
              └── Proof-of-work puzzles

Layer 4 ──── HTTP Analysis
              ├── Header order/consistency
              ├── TLS fingerprint (JA3/JA4)
              └── HTTP/2 fingerprint

Layer 3 ──── Network Layer
              ├── IP reputation
              ├── ASN classification
              ├── Geo-location matching
              └── DNS analysis
Enter fullscreen mode Exit fullscreen mode

Each layer catches different kinds of bots. Let's go through them all.

Layer 3: Network Analysis

IP Reputation

Anti-bot services maintain databases of known bad IPs:

import httpx

def check_ip_reputation(ip: str) -> dict:
    """Check if an IP is flagged in common databases."""

    checks = {}

    # AbuseIPDB
    resp = httpx.get(
        "https://api.abuseipdb.com/api/v2/check",
        params={"ipAddress": ip},
        headers={"Key": ABUSEIPDB_KEY}
    ).json()
    checks["abuse_score"] = resp["data"]["abuseConfidenceScore"]

    # Check if datacenter IP
    checks["is_datacenter"] = is_datacenter_ip(ip)

    return checks

def is_datacenter_ip(ip: str) -> bool:
    """Check if IP belongs to a known hosting provider."""
    import ipaddress

    DATACENTER_RANGES = [
        "13.0.0.0/8",     # AWS
        "34.0.0.0/8",     # GCP
        "40.0.0.0/8",     # Azure
        "104.16.0.0/12",  # Cloudflare
        "157.240.0.0/16", # Meta
    ]

    addr = ipaddress.ip_address(ip)
    return any(
        addr in ipaddress.ip_network(cidr)
        for cidr in DATACENTER_RANGES
    )
Enter fullscreen mode Exit fullscreen mode

How to handle it:

  • Use residential proxies for sensitive targets
  • Rotate IPs from different subnets
  • Avoid datacenter IPs for sites with strict anti-bot

ASN Classification

Sites check your IP's Autonomous System Number (ASN) to identify hosting providers:

# What anti-bot services see:
# ✅ AS7922 (Comcast) → Residential
# ✅ AS7018 (AT&T) → Residential
# ❌ AS16509 (Amazon/AWS) → Datacenter
# ❌ AS14061 (DigitalOcean) → Datacenter
# ⚠️ AS13335 (Cloudflare) → CDN/VPN
Enter fullscreen mode Exit fullscreen mode

Layer 4: HTTP Analysis

TLS Fingerprinting (JA3/JA4)

Every HTTP client has a unique TLS handshake signature:

# Different clients, different fingerprints:
# Chrome 121:  JA3 = 771,4865-4866-4867-49195...
# Python httpx: JA3 = 771,4866-4867-4865-49196...
# curl:         JA3 = 771,4865-4867-4866-49195...

# Solution: curl_cffi mimics browser TLS
from curl_cffi import requests

resp = requests.get(
    "https://target.com",
    impersonate="chrome120"
)
# Now your JA3 matches Chrome 120 exactly
Enter fullscreen mode Exit fullscreen mode

Header Order

Browsers send headers in a specific order. Python libraries don't:

# Chrome sends:
# Host, Connection, sec-ch-ua, sec-ch-ua-mobile,
# sec-ch-ua-platform, Upgrade-Insecure-Requests,
# User-Agent, Accept, Sec-Fetch-Site...

# httpx sends:
# Host, User-Agent, Accept, Accept-Encoding,
# Connection...

# Fix: use OrderedDict or curl_cffi
headers = {
    "Host": "target.com",
    "Connection": "keep-alive",
    "sec-ch-ua": '"Chromium";v="121"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"Windows"',
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0...",
    "Accept": "text/html,...",
}
Enter fullscreen mode Exit fullscreen mode

HTTP/2 Fingerprinting

HTTP/2 settings (SETTINGS frame, WINDOW_UPDATE, PRIORITY) also create a unique fingerprint:

# Chrome's HTTP/2 settings:
# HEADER_TABLE_SIZE: 65536
# MAX_CONCURRENT_STREAMS: 1000
# INITIAL_WINDOW_SIZE: 6291456
# MAX_HEADER_LIST_SIZE: 262144

# Python's default: different values
# → Detectable

# curl_cffi handles this automatically
# when you use impersonate="chrome120"
Enter fullscreen mode Exit fullscreen mode

Layer 5: JavaScript Challenges

Canvas Fingerprinting

Sites render hidden canvas elements and hash the result:

// What anti-bot JS does:
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
ctx.textBaseline = "top";
ctx.font = "14px Arial";
ctx.fillText("Hello, world!", 2, 2);
const hash = canvas.toDataURL().hashCode();
// Different GPUs/drivers = different hash
Enter fullscreen mode Exit fullscreen mode

WebGL Fingerprinting

const gl = canvas.getContext('webgl');
const debugInfo = gl.getExtension(
    'WEBGL_debug_renderer_info'
);

// Real browser:
gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
// → "ANGLE (NVIDIA GeForce GTX 1080 Ti...)"

// Headless Chrome:
// → "Google SwiftShader" ← DETECTED
Enter fullscreen mode Exit fullscreen mode

Behavioral Analysis

Advanced systems track mouse movements, scroll patterns, and typing speed:

// What sites collect:
document.addEventListener('mousemove', (e) => {
    events.push({
        type: 'mouse',
        x: e.clientX, 
        y: e.clientY,
        t: Date.now()
    });
});

// Bot detection signals:
// - No mouse movement before form submit
// - Perfect straight-line mouse paths
// - Zero scroll events
// - Instant form fill (< 1 second)
// - No focus/blur events on inputs
Enter fullscreen mode Exit fullscreen mode

How to simulate human behavior:

import random
import asyncio

async def human_like_interaction(page):
    """Simulate realistic user behavior."""

    # Random mouse movements
    for _ in range(random.randint(3, 7)):
        x = random.randint(100, 800)
        y = random.randint(100, 600)
        await page.mouse.move(x, y, steps=random.randint(5, 15))
        await asyncio.sleep(random.uniform(0.1, 0.5))

    # Scroll down naturally
    for _ in range(random.randint(2, 4)):
        await page.mouse.wheel(
            0, random.randint(100, 300)
        )
        await asyncio.sleep(random.uniform(0.5, 1.5))

    # Type with human-like delays
    async def human_type(selector, text):
        await page.click(selector)
        await asyncio.sleep(random.uniform(0.3, 0.8))
        for char in text:
            await page.keyboard.type(
                char, delay=random.randint(50, 150)
            )
Enter fullscreen mode Exit fullscreen mode

Layer 6: CAPTCHA Challenges

The most visible layer. Sites deploy CAPTCHAs when other signals suggest bot activity:

# Unified CAPTCHA handler
async def handle_any_captcha(
    page, solver
) -> bool:
    """Detect and solve any CAPTCHA type."""

    detectors = {
        ".g-recaptcha": {
            "type": "recaptcha_v2",
            "key_attr": "data-sitekey"
        },
        ".h-captcha": {
            "type": "hcaptcha",
            "key_attr": "data-sitekey"
        },
        ".cf-turnstile": {
            "type": "turnstile",
            "key_attr": "data-sitekey"
        },
        "[data-pkey]": {
            "type": "funcaptcha",
            "key_attr": "data-pkey"
        },
    }

    for selector, config in detectors.items():
        el = await page.query_selector(selector)
        if el:
            sitekey = await el.get_attribute(
                config["key_attr"]
            )
            token = await solver.solve(
                captcha_type=config["type"],
                sitekey=sitekey,
                url=page.url,
            )
            await inject_token(page, config["type"], token)
            return True

    return False
Enter fullscreen mode Exit fullscreen mode

Layer 7: Application Logic

Honeypot Fields

Hidden form fields that real users never fill:

<!-- Trap for bots -->
<input type="text" 
       name="website" 
       style="display:none" 
       tabindex="-1">
Enter fullscreen mode Exit fullscreen mode
# Don't fill fields that are hidden!
async def fill_form_safely(page, data: dict):
    for field, value in data.items():
        el = await page.query_selector(
            f'input[name="{field}"]'
        )
        if el:
            visible = await el.is_visible()
            if visible:
                await el.fill(value)
            # Skip hidden fields — they're traps
Enter fullscreen mode Exit fullscreen mode

Timing Analysis

Sites track how fast you interact:

# Too fast → bot
# A real user takes 5-30 seconds to fill a form
# A bot fills it in < 1 second

async def realistic_form_fill(page, data):
    # Wait before starting (reading the page)
    await asyncio.sleep(random.uniform(2, 5))

    for field, value in data.items():
        await page.click(f'input[name="{field}"]')
        await asyncio.sleep(random.uniform(0.5, 1.5))
        await page.type(
            f'input[name="{field}"]', value,
            delay=random.randint(30, 100)
        )
        await asyncio.sleep(random.uniform(0.3, 0.8))

    # Pause before submitting
    await asyncio.sleep(random.uniform(1, 3))
Enter fullscreen mode Exit fullscreen mode

Building a Complete Anti-Detection Stack

class StealthScraper:
    """Scraper that handles all detection layers."""

    def __init__(self):
        self.proxy_pool = ResidentialProxyPool()  # Layer 3
        self.tls_client = TLSClient("chrome120")  # Layer 4
        self.browser_pool = BrowserPool(stealth=True)  # Layer 5
        self.captcha_solver = CaptchaSolver(  # Layer 6
            api_base="https://www.passxapi.com"
        )

    async def scrape(self, url: str) -> dict:
        # Layer 3: Get clean proxy
        proxy = await self.proxy_pool.get(
            region=get_target_region(url)
        )

        async with self.browser_pool.get_page(
            proxy=proxy
        ) as page:
            # Layer 5: Apply stealth patches
            await apply_stealth(page)

            # Layer 7: Simulate human behavior
            await human_like_interaction(page)

            # Navigate
            await page.goto(url)

            # Layer 6: Handle CAPTCHAs
            await handle_any_captcha(
                page, self.captcha_solver
            )

            # Layer 7: Avoid honeypots
            await fill_form_safely(page, form_data)

            return await extract_data(page)
Enter fullscreen mode Exit fullscreen mode

Detection Layer Priorities

Not all layers matter equally. Focus on what catches you:

Layer Detection Rate Effort to Bypass Priority
IP reputation 30% Low (proxies) High
TLS fingerprint 25% Low (curl_cffi) High
CAPTCHA 20% Medium (API solver) High
JS fingerprint 15% Medium (stealth) Medium
Behavioral 5% High (simulation) Low
Honeypots 3% Low (skip hidden) Low
Header order 2% Low (manual) Low

Start from the top. Most scrapers get blocked at layers 3-4, not 5-7.

Key Takeaways

  1. Detection is layered — fixing one layer while ignoring others won't work
  2. Start with network + TLS — these catch 55% of bots before JS even runs
  3. CAPTCHAs are the visible layer — but they're triggered by invisible signals
  4. Behavioral analysis is growing — mouse movement and timing matter more each year
  5. Test your stealth — use bot detection sites to audit your setup
  6. Always have a CAPTCHA solver — even perfect stealth can't avoid all challenges

For handling the CAPTCHA layer when it triggers, check out passxapi-python — it provides a unified API for reCAPTCHA, hCaptcha, Turnstile, and FunCaptcha, so you can focus on the other layers.


Which detection layer causes you the most trouble? Share your experience in the comments.

Top comments (0)