Alex Chen

Posted on Mar 23

The Complete Anti-Bot Detection Stack: Every Technique Sites Use to Catch Your Scraper

You've handled CAPTCHAs, rotated proxies, and spoofed your User-Agent. Your scraper still gets blocked. Why?

Because modern anti-bot systems don't rely on a single detection method. They stack multiple layers — and you need to understand all of them to build a scraper that survives in production.

This guide maps out every detection technique sites use today, from basic to advanced.

The Detection Layers

Layer 7 ──── Application Logic
              ├── Rate limiting
              ├── Behavioral analysis
              └── Business logic traps

Layer 6 ──── CAPTCHA Challenges
              ├── reCAPTCHA v2/v3
              ├── hCaptcha / Enterprise
              ├── Cloudflare Turnstile
              └── FunCaptcha

Layer 5 ──── JavaScript Challenges
              ├── Browser fingerprinting
              ├── Canvas/WebGL hashing
              └── Proof-of-work puzzles

Layer 4 ──── HTTP Analysis
              ├── Header order/consistency
              ├── TLS fingerprint (JA3/JA4)
              └── HTTP/2 fingerprint

Layer 3 ──── Network Layer
              ├── IP reputation
              ├── ASN classification
              ├── Geo-location matching
              └── DNS analysis

Each layer catches different kinds of bots. Let's go through them all.

Layer 3: Network Analysis

IP Reputation

Anti-bot services maintain databases of known bad IPs:

import httpx

def check_ip_reputation(ip: str) -> dict:
    """Check if an IP is flagged in common databases."""

    checks = {}

    # AbuseIPDB
    resp = httpx.get(
        "https://api.abuseipdb.com/api/v2/check",
        params={"ipAddress": ip},
        headers={"Key": ABUSEIPDB_KEY}
    ).json()
    checks["abuse_score"] = resp["data"]["abuseConfidenceScore"]

    # Check if datacenter IP
    checks["is_datacenter"] = is_datacenter_ip(ip)

    return checks

def is_datacenter_ip(ip: str) -> bool:
    """Check if IP belongs to a known hosting provider."""
    import ipaddress

    DATACENTER_RANGES = [
        "13.0.0.0/8",     # AWS
        "34.0.0.0/8",     # GCP
        "40.0.0.0/8",     # Azure
        "104.16.0.0/12",  # Cloudflare
        "157.240.0.0/16", # Meta
    ]

    addr = ipaddress.ip_address(ip)
    return any(
        addr in ipaddress.ip_network(cidr)
        for cidr in DATACENTER_RANGES
    )

How to handle it:

Use residential proxies for sensitive targets
Rotate IPs from different subnets
Avoid datacenter IPs for sites with strict anti-bot

ASN Classification

Sites check your IP's Autonomous System Number (ASN) to identify hosting providers:

# What anti-bot services see:
# ✅ AS7922 (Comcast) → Residential
# ✅ AS7018 (AT&T) → Residential
# ❌ AS16509 (Amazon/AWS) → Datacenter
# ❌ AS14061 (DigitalOcean) → Datacenter
# ⚠️ AS13335 (Cloudflare) → CDN/VPN

Layer 4: HTTP Analysis

TLS Fingerprinting (JA3/JA4)

Every HTTP client has a unique TLS handshake signature:

# Different clients, different fingerprints:
# Chrome 121:  JA3 = 771,4865-4866-4867-49195...
# Python httpx: JA3 = 771,4866-4867-4865-49196...
# curl:         JA3 = 771,4865-4867-4866-49195...

# Solution: curl_cffi mimics browser TLS
from curl_cffi import requests

resp = requests.get(
    "https://target.com",
    impersonate="chrome120"
)
# Now your JA3 matches Chrome 120 exactly

Header Order

Browsers send headers in a specific order. Python libraries don't:

# Chrome sends:
# Host, Connection, sec-ch-ua, sec-ch-ua-mobile,
# sec-ch-ua-platform, Upgrade-Insecure-Requests,
# User-Agent, Accept, Sec-Fetch-Site...

# httpx sends:
# Host, User-Agent, Accept, Accept-Encoding,
# Connection...

# Fix: use OrderedDict or curl_cffi
headers = {
    "Host": "target.com",
    "Connection": "keep-alive",
    "sec-ch-ua": '"Chromium";v="121"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"Windows"',
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0...",
    "Accept": "text/html,...",
}

HTTP/2 Fingerprinting

HTTP/2 settings (SETTINGS frame, WINDOW_UPDATE, PRIORITY) also create a unique fingerprint:

# Chrome's HTTP/2 settings:
# HEADER_TABLE_SIZE: 65536
# MAX_CONCURRENT_STREAMS: 1000
# INITIAL_WINDOW_SIZE: 6291456
# MAX_HEADER_LIST_SIZE: 262144

# Python's default: different values
# → Detectable

# curl_cffi handles this automatically
# when you use impersonate="chrome120"

Layer 5: JavaScript Challenges

Canvas Fingerprinting

Sites render hidden canvas elements and hash the result:

// What anti-bot JS does:
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
ctx.textBaseline = "top";
ctx.font = "14px Arial";
ctx.fillText("Hello, world!", 2, 2);
const hash = canvas.toDataURL().hashCode();
// Different GPUs/drivers = different hash

WebGL Fingerprinting

const gl = canvas.getContext('webgl');
const debugInfo = gl.getExtension(
    'WEBGL_debug_renderer_info'
);

// Real browser:
gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
// → "ANGLE (NVIDIA GeForce GTX 1080 Ti...)"

// Headless Chrome:
// → "Google SwiftShader" ← DETECTED

Behavioral Analysis

Advanced systems track mouse movements, scroll patterns, and typing speed:

// What sites collect:
document.addEventListener('mousemove', (e) => {
    events.push({
        type: 'mouse',
        x: e.clientX, 
        y: e.clientY,
        t: Date.now()
    });
});

// Bot detection signals:
// - No mouse movement before form submit
// - Perfect straight-line mouse paths
// - Zero scroll events
// - Instant form fill (< 1 second)
// - No focus/blur events on inputs

How to simulate human behavior:

import random
import asyncio

async def human_like_interaction(page):
    """Simulate realistic user behavior."""

    # Random mouse movements
    for _ in range(random.randint(3, 7)):
        x = random.randint(100, 800)
        y = random.randint(100, 600)
        await page.mouse.move(x, y, steps=random.randint(5, 15))
        await asyncio.sleep(random.uniform(0.1, 0.5))

    # Scroll down naturally
    for _ in range(random.randint(2, 4)):
        await page.mouse.wheel(
            0, random.randint(100, 300)
        )
        await asyncio.sleep(random.uniform(0.5, 1.5))

    # Type with human-like delays
    async def human_type(selector, text):
        await page.click(selector)
        await asyncio.sleep(random.uniform(0.3, 0.8))
        for char in text:
            await page.keyboard.type(
                char, delay=random.randint(50, 150)
            )

Layer 6: CAPTCHA Challenges

The most visible layer. Sites deploy CAPTCHAs when other signals suggest bot activity:

# Unified CAPTCHA handler
async def handle_any_captcha(
    page, solver
) -> bool:
    """Detect and solve any CAPTCHA type."""

    detectors = {
        ".g-recaptcha": {
            "type": "recaptcha_v2",
            "key_attr": "data-sitekey"
        },
        ".h-captcha": {
            "type": "hcaptcha",
            "key_attr": "data-sitekey"
        },
        ".cf-turnstile": {
            "type": "turnstile",
            "key_attr": "data-sitekey"
        },
        "[data-pkey]": {
            "type": "funcaptcha",
            "key_attr": "data-pkey"
        },
    }

    for selector, config in detectors.items():
        el = await page.query_selector(selector)
        if el:
            sitekey = await el.get_attribute(
                config["key_attr"]
            )
            token = await solver.solve(
                captcha_type=config["type"],
                sitekey=sitekey,
                url=page.url,
            )
            await inject_token(page, config["type"], token)
            return True

    return False

Layer 7: Application Logic

Honeypot Fields

Hidden form fields that real users never fill:

<!-- Trap for bots -->
<input type="text" 
       name="website" 
       style="display:none" 
       tabindex="-1">

# Don't fill fields that are hidden!
async def fill_form_safely(page, data: dict):
    for field, value in data.items():
        el = await page.query_selector(
            f'input[name="{field}"]'
        )
        if el:
            visible = await el.is_visible()
            if visible:
                await el.fill(value)
            # Skip hidden fields — they're traps

Timing Analysis

Sites track how fast you interact:

# Too fast → bot
# A real user takes 5-30 seconds to fill a form
# A bot fills it in < 1 second

async def realistic_form_fill(page, data):
    # Wait before starting (reading the page)
    await asyncio.sleep(random.uniform(2, 5))

    for field, value in data.items():
        await page.click(f'input[name="{field}"]')
        await asyncio.sleep(random.uniform(0.5, 1.5))
        await page.type(
            f'input[name="{field}"]', value,
            delay=random.randint(30, 100)
        )
        await asyncio.sleep(random.uniform(0.3, 0.8))

    # Pause before submitting
    await asyncio.sleep(random.uniform(1, 3))

Building a Complete Anti-Detection Stack

class StealthScraper:
    """Scraper that handles all detection layers."""

    def __init__(self):
        self.proxy_pool = ResidentialProxyPool()  # Layer 3
        self.tls_client = TLSClient("chrome120")  # Layer 4
        self.browser_pool = BrowserPool(stealth=True)  # Layer 5
        self.captcha_solver = CaptchaSolver(  # Layer 6
            api_base="https://www.passxapi.com"
        )

    async def scrape(self, url: str) -> dict:
        # Layer 3: Get clean proxy
        proxy = await self.proxy_pool.get(
            region=get_target_region(url)
        )

        async with self.browser_pool.get_page(
            proxy=proxy
        ) as page:
            # Layer 5: Apply stealth patches
            await apply_stealth(page)

            # Layer 7: Simulate human behavior
            await human_like_interaction(page)

            # Navigate
            await page.goto(url)

            # Layer 6: Handle CAPTCHAs
            await handle_any_captcha(
                page, self.captcha_solver
            )

            # Layer 7: Avoid honeypots
            await fill_form_safely(page, form_data)

            return await extract_data(page)

Detection Layer Priorities

Not all layers matter equally. Focus on what catches you:

Layer	Detection Rate	Effort to Bypass	Priority
IP reputation	30%	Low (proxies)	High
TLS fingerprint	25%	Low (curl_cffi)	High
CAPTCHA	20%	Medium (API solver)	High
JS fingerprint	15%	Medium (stealth)	Medium
Behavioral	5%	High (simulation)	Low
Honeypots	3%	Low (skip hidden)	Low
Header order	2%	Low (manual)	Low

Start from the top. Most scrapers get blocked at layers 3-4, not 5-7.

Key Takeaways

Detection is layered — fixing one layer while ignoring others won't work
Start with network + TLS — these catch 55% of bots before JS even runs
CAPTCHAs are the visible layer — but they're triggered by invisible signals
Behavioral analysis is growing — mouse movement and timing matter more each year
Test your stealth — use bot detection sites to audit your setup
Always have a CAPTCHA solver — even perfect stealth can't avoid all challenges

For handling the CAPTCHA layer when it triggers, check out passxapi-python — it provides a unified API for reCAPTCHA, hCaptcha, Turnstile, and FunCaptcha, so you can focus on the other layers.

Which detection layer causes you the most trouble? Share your experience in the comments.

DEV Community