You've handled CAPTCHAs, rotated proxies, and spoofed your User-Agent. Your scraper still gets blocked. Why?
Because modern anti-bot systems don't rely on a single detection method. They stack multiple layers — and you need to understand all of them to build a scraper that survives in production.
This guide maps out every detection technique sites use today, from basic to advanced.
The Detection Layers
Layer 7 ──── Application Logic
├── Rate limiting
├── Behavioral analysis
└── Business logic traps
Layer 6 ──── CAPTCHA Challenges
├── reCAPTCHA v2/v3
├── hCaptcha / Enterprise
├── Cloudflare Turnstile
└── FunCaptcha
Layer 5 ──── JavaScript Challenges
├── Browser fingerprinting
├── Canvas/WebGL hashing
└── Proof-of-work puzzles
Layer 4 ──── HTTP Analysis
├── Header order/consistency
├── TLS fingerprint (JA3/JA4)
└── HTTP/2 fingerprint
Layer 3 ──── Network Layer
├── IP reputation
├── ASN classification
├── Geo-location matching
└── DNS analysis
Each layer catches different kinds of bots. Let's go through them all.
Layer 3: Network Analysis
IP Reputation
Anti-bot services maintain databases of known bad IPs:
import httpx
def check_ip_reputation(ip: str) -> dict:
"""Check if an IP is flagged in common databases."""
checks = {}
# AbuseIPDB
resp = httpx.get(
"https://api.abuseipdb.com/api/v2/check",
params={"ipAddress": ip},
headers={"Key": ABUSEIPDB_KEY}
).json()
checks["abuse_score"] = resp["data"]["abuseConfidenceScore"]
# Check if datacenter IP
checks["is_datacenter"] = is_datacenter_ip(ip)
return checks
def is_datacenter_ip(ip: str) -> bool:
"""Check if IP belongs to a known hosting provider."""
import ipaddress
DATACENTER_RANGES = [
"13.0.0.0/8", # AWS
"34.0.0.0/8", # GCP
"40.0.0.0/8", # Azure
"104.16.0.0/12", # Cloudflare
"157.240.0.0/16", # Meta
]
addr = ipaddress.ip_address(ip)
return any(
addr in ipaddress.ip_network(cidr)
for cidr in DATACENTER_RANGES
)
How to handle it:
- Use residential proxies for sensitive targets
- Rotate IPs from different subnets
- Avoid datacenter IPs for sites with strict anti-bot
ASN Classification
Sites check your IP's Autonomous System Number (ASN) to identify hosting providers:
# What anti-bot services see:
# ✅ AS7922 (Comcast) → Residential
# ✅ AS7018 (AT&T) → Residential
# ❌ AS16509 (Amazon/AWS) → Datacenter
# ❌ AS14061 (DigitalOcean) → Datacenter
# ⚠️ AS13335 (Cloudflare) → CDN/VPN
Layer 4: HTTP Analysis
TLS Fingerprinting (JA3/JA4)
Every HTTP client has a unique TLS handshake signature:
# Different clients, different fingerprints:
# Chrome 121: JA3 = 771,4865-4866-4867-49195...
# Python httpx: JA3 = 771,4866-4867-4865-49196...
# curl: JA3 = 771,4865-4867-4866-49195...
# Solution: curl_cffi mimics browser TLS
from curl_cffi import requests
resp = requests.get(
"https://target.com",
impersonate="chrome120"
)
# Now your JA3 matches Chrome 120 exactly
Header Order
Browsers send headers in a specific order. Python libraries don't:
# Chrome sends:
# Host, Connection, sec-ch-ua, sec-ch-ua-mobile,
# sec-ch-ua-platform, Upgrade-Insecure-Requests,
# User-Agent, Accept, Sec-Fetch-Site...
# httpx sends:
# Host, User-Agent, Accept, Accept-Encoding,
# Connection...
# Fix: use OrderedDict or curl_cffi
headers = {
"Host": "target.com",
"Connection": "keep-alive",
"sec-ch-ua": '"Chromium";v="121"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0...",
"Accept": "text/html,...",
}
HTTP/2 Fingerprinting
HTTP/2 settings (SETTINGS frame, WINDOW_UPDATE, PRIORITY) also create a unique fingerprint:
# Chrome's HTTP/2 settings:
# HEADER_TABLE_SIZE: 65536
# MAX_CONCURRENT_STREAMS: 1000
# INITIAL_WINDOW_SIZE: 6291456
# MAX_HEADER_LIST_SIZE: 262144
# Python's default: different values
# → Detectable
# curl_cffi handles this automatically
# when you use impersonate="chrome120"
Layer 5: JavaScript Challenges
Canvas Fingerprinting
Sites render hidden canvas elements and hash the result:
// What anti-bot JS does:
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
ctx.textBaseline = "top";
ctx.font = "14px Arial";
ctx.fillText("Hello, world!", 2, 2);
const hash = canvas.toDataURL().hashCode();
// Different GPUs/drivers = different hash
WebGL Fingerprinting
const gl = canvas.getContext('webgl');
const debugInfo = gl.getExtension(
'WEBGL_debug_renderer_info'
);
// Real browser:
gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
// → "ANGLE (NVIDIA GeForce GTX 1080 Ti...)"
// Headless Chrome:
// → "Google SwiftShader" ← DETECTED
Behavioral Analysis
Advanced systems track mouse movements, scroll patterns, and typing speed:
// What sites collect:
document.addEventListener('mousemove', (e) => {
events.push({
type: 'mouse',
x: e.clientX,
y: e.clientY,
t: Date.now()
});
});
// Bot detection signals:
// - No mouse movement before form submit
// - Perfect straight-line mouse paths
// - Zero scroll events
// - Instant form fill (< 1 second)
// - No focus/blur events on inputs
How to simulate human behavior:
import random
import asyncio
async def human_like_interaction(page):
"""Simulate realistic user behavior."""
# Random mouse movements
for _ in range(random.randint(3, 7)):
x = random.randint(100, 800)
y = random.randint(100, 600)
await page.mouse.move(x, y, steps=random.randint(5, 15))
await asyncio.sleep(random.uniform(0.1, 0.5))
# Scroll down naturally
for _ in range(random.randint(2, 4)):
await page.mouse.wheel(
0, random.randint(100, 300)
)
await asyncio.sleep(random.uniform(0.5, 1.5))
# Type with human-like delays
async def human_type(selector, text):
await page.click(selector)
await asyncio.sleep(random.uniform(0.3, 0.8))
for char in text:
await page.keyboard.type(
char, delay=random.randint(50, 150)
)
Layer 6: CAPTCHA Challenges
The most visible layer. Sites deploy CAPTCHAs when other signals suggest bot activity:
# Unified CAPTCHA handler
async def handle_any_captcha(
page, solver
) -> bool:
"""Detect and solve any CAPTCHA type."""
detectors = {
".g-recaptcha": {
"type": "recaptcha_v2",
"key_attr": "data-sitekey"
},
".h-captcha": {
"type": "hcaptcha",
"key_attr": "data-sitekey"
},
".cf-turnstile": {
"type": "turnstile",
"key_attr": "data-sitekey"
},
"[data-pkey]": {
"type": "funcaptcha",
"key_attr": "data-pkey"
},
}
for selector, config in detectors.items():
el = await page.query_selector(selector)
if el:
sitekey = await el.get_attribute(
config["key_attr"]
)
token = await solver.solve(
captcha_type=config["type"],
sitekey=sitekey,
url=page.url,
)
await inject_token(page, config["type"], token)
return True
return False
Layer 7: Application Logic
Honeypot Fields
Hidden form fields that real users never fill:
<!-- Trap for bots -->
<input type="text"
name="website"
style="display:none"
tabindex="-1">
# Don't fill fields that are hidden!
async def fill_form_safely(page, data: dict):
for field, value in data.items():
el = await page.query_selector(
f'input[name="{field}"]'
)
if el:
visible = await el.is_visible()
if visible:
await el.fill(value)
# Skip hidden fields — they're traps
Timing Analysis
Sites track how fast you interact:
# Too fast → bot
# A real user takes 5-30 seconds to fill a form
# A bot fills it in < 1 second
async def realistic_form_fill(page, data):
# Wait before starting (reading the page)
await asyncio.sleep(random.uniform(2, 5))
for field, value in data.items():
await page.click(f'input[name="{field}"]')
await asyncio.sleep(random.uniform(0.5, 1.5))
await page.type(
f'input[name="{field}"]', value,
delay=random.randint(30, 100)
)
await asyncio.sleep(random.uniform(0.3, 0.8))
# Pause before submitting
await asyncio.sleep(random.uniform(1, 3))
Building a Complete Anti-Detection Stack
class StealthScraper:
"""Scraper that handles all detection layers."""
def __init__(self):
self.proxy_pool = ResidentialProxyPool() # Layer 3
self.tls_client = TLSClient("chrome120") # Layer 4
self.browser_pool = BrowserPool(stealth=True) # Layer 5
self.captcha_solver = CaptchaSolver( # Layer 6
api_base="https://www.passxapi.com"
)
async def scrape(self, url: str) -> dict:
# Layer 3: Get clean proxy
proxy = await self.proxy_pool.get(
region=get_target_region(url)
)
async with self.browser_pool.get_page(
proxy=proxy
) as page:
# Layer 5: Apply stealth patches
await apply_stealth(page)
# Layer 7: Simulate human behavior
await human_like_interaction(page)
# Navigate
await page.goto(url)
# Layer 6: Handle CAPTCHAs
await handle_any_captcha(
page, self.captcha_solver
)
# Layer 7: Avoid honeypots
await fill_form_safely(page, form_data)
return await extract_data(page)
Detection Layer Priorities
Not all layers matter equally. Focus on what catches you:
| Layer | Detection Rate | Effort to Bypass | Priority |
|---|---|---|---|
| IP reputation | 30% | Low (proxies) | High |
| TLS fingerprint | 25% | Low (curl_cffi) | High |
| CAPTCHA | 20% | Medium (API solver) | High |
| JS fingerprint | 15% | Medium (stealth) | Medium |
| Behavioral | 5% | High (simulation) | Low |
| Honeypots | 3% | Low (skip hidden) | Low |
| Header order | 2% | Low (manual) | Low |
Start from the top. Most scrapers get blocked at layers 3-4, not 5-7.
Key Takeaways
- Detection is layered — fixing one layer while ignoring others won't work
- Start with network + TLS — these catch 55% of bots before JS even runs
- CAPTCHAs are the visible layer — but they're triggered by invisible signals
- Behavioral analysis is growing — mouse movement and timing matter more each year
- Test your stealth — use bot detection sites to audit your setup
- Always have a CAPTCHA solver — even perfect stealth can't avoid all challenges
For handling the CAPTCHA layer when it triggers, check out passxapi-python — it provides a unified API for reCAPTCHA, hCaptcha, Turnstile, and FunCaptcha, so you can focus on the other layers.
Which detection layer causes you the most trouble? Share your experience in the comments.
Top comments (0)