Web Scraping Without Bans: The Definitive 2026 Anti-Detection Playbook
You built a scraper. It works. Then, slowly or suddenly, it stops. 403s. CAPTCHAs. Infinite redirects. The target site learned your pattern.
This is the reality of web scraping at scale. The techniques in this guide are the result of building and running 30+ production scrapers — from contact info extractors handling 831 runs to LinkedIn job scrapers navigating aggressive bot protection. This is what actually works to stay operational.
How Sites Detect Scrapers (The Attacker's View)
Before defending, you need to understand the detection stack:
Layer 1: Network Layer
- IP reputation — Datacenter IPs are flagged immediately. AWS, DigitalOcean, Linode IPs are in known ranges.
- Geographic inconsistency — If your IP claims to be in Germany but your TLS fingerprint is from a VPN exit in Romania, that's a signal.
- ASN history — Cloudflare and Google maintain lists of ASN patterns for cloud providers.
Layer 2: HTTP Protocol Layer
- TLS fingerprint — Every HTTP client (Python requests, Go net/http, Node axios) has a unique TLS handshake signature. Cloudflare and Akamai fingerprint these.
- HTTP/2 frame ordering — The sequence of HTTP/2 frames differs between clients.
-
Header ordering and casing — Real browsers send headers in specific orders with specific casing.
Content-typevscontent-typematters. - Missing headers — Real browsers send Accept, Accept-Language, Accept-Encoding, Connection, Upgrade-Insecure-Requests. Missing any of these is a bot signal.
Layer 3: Application Layer
- Request rate — Humans don't load 50 pages in 3 seconds.
- Navigation patterns — Real users click links. Scrapers request URLs directly.
- Missing referrer — Opening a product page without a referrer is unusual.
- No mouse/click events — JavaScript-heavy sites track actual user interaction.
Layer 4: Behavioral Layer (Hardest to Fake)
- Mouse movement patterns — Bots move the mouse in straight lines. Real humans move in curves with micro-corrections.
- Scroll behavior — Instant scrolls vs human scroll deceleration.
- Time between actions — Real users read content. Bots don't.
The Anti-Detection Stack (In Order of Impact)
1. Rotate Your User-Agent
This is free and blocks 10–15% of naive bot detection:
import random
BROWSER_UAS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.3 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]
HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
}
def get_random_headers():
headers = HEADERS.copy()
headers["User-Agent"] = random.choice(BROWSER_UAS)
return headers
2. Session Rotation and Cookie Management
Websites track sessions. A single session making 200 requests in 10 minutes is obviously a bot:
import requests
import time
import random
class RotatingSession:
def __init__(self, max_requests_per_session=30):
self.max_requests = max_requests_per_session
self.sessions = []
self.current_session = None
self.request_count = 0
self._new_session()
def _new_session(self):
self.current_session = requests.Session()
self.current_session.headers.update(get_random_headers())
self.request_count = 0
def get(self, url, **kwargs):
if self.request_count >= self.max_requests:
self._new_session()
# Add human-like delay
time.sleep(random.uniform(1.5, 4.0))
self.request_count += 1
return self.current_session.get(url, **kwargs)
3. Rate Limiting — The Most Underrated Fix
The single most effective anti-ban technique is also the simplest: slow down:
import time
import random
from collections import deque
from datetime import datetime, timedelta
class AdaptiveRateLimiter:
def __init__(self, base_delay=2.0, max_delay=30.0):
self.base_delay = base_delay
self.max_delay = max_delay
self.current_delay = base_delay
self.success_times = deque(maxlen=20)
self.failure_times = deque(maxlen=10)
def wait(self):
jitter = random.uniform(-0.5, 0.5)
actual = max(0.5, self.current_delay + jitter)
time.sleep(actual)
def record_success(self):
self.success_times.append(datetime.now())
if len(self.success_times) >= 10:
# Gradually reduce delay on sustained success
self.current_delay = max(self.base_delay, self.current_delay * 0.9)
def record_failure(self, status_code=None):
self.failure_times.append(datetime.now())
if status_code in (403, 429):
# Sharp increase on blocks
self.current_delay = min(self.max_delay, self.current_delay * 3)
else:
self.current_delay = min(self.max_delay, self.current_delay * 1.5)
def should_wait_longer(self):
"""Check if last failure was recent."""
if not self.failure_times:
return False
return datetime.now() - self.failure_times[-1] < timedelta(minutes=5)
4. Proxy Rotation (Non-Negotiable for Scale)
If you're scraping more than 50 pages/hour from a single domain, you need proxies. Not debatable.
Proxy hierarchy:
| Type | Success Rate | Cost | Use Case |
|---|---|---|---|
| Datacenter | 5–20% | Free–$0.10/IP | Testing only |
| Shared residential | 40–60% | $5–$15/GB | Light scraping |
| Dedicated residential | 70–85% | $10–$30/GB | Production scraping |
| Mobile 4G | 85–95% | $25–$50/GB | Hard targets (LinkedIn, Google) |
| ISP/s datacenter | 60–75% | $5–$15/IP/mo | Sustained sessions |
Integration:
import requests
class ProxyRotator:
def __init__(self, proxy_provider_api):
self.api = proxy_provider_api
self.proxy_list = []
self.current_index = 0
def get_proxy(self):
if not self.proxy_list:
self._refresh_proxies()
proxy = self.proxy_list[self.current_index]
self.current_index = (self.current_index + 1) % len(self.proxy_list)
return proxy
def _refresh_proxies(self):
# Example: fetch from your proxy provider's API
# This varies by provider (Bright Data, Oxylabs, ScraperAPI, etc.)
import json
response = requests.get(self.api, timeout=10)
data = json.loads(response.text)
self.proxy_list = data.get("proxies", [])
def get_with_proxy(self, url, **kwargs):
proxy = self.get_proxy()
proxies = {"http": proxy, "https": proxy}
return requests.get(url, proxies=proxies, **kwargs)
5. Headless Browser for JavaScript-Heavy Sites
For sites that render content with JavaScript, you need a real browser engine:
from playwright.sync_api import sync_playwright
import random
def scrape_browser(url, anti_detection=True):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--disable-accelerated-2d-canvas",
"--no-first-run",
"--no-zygote",
"--disable-gpu",
]
)
context_args = {
"user_agent": random.choice(BROWSER_UAS),
"viewport": {"width": random.randint(1280, 1920), "height": random.randint(720, 1080)},
"locale": "en-US",
"timezone_id": "America/New_York",
}
if anti_detection:
context_args["viewport"] = {"width": 1920, "height": 1080}
context = browser.new_context(**context_args)
page = context.new_page()
# Human-like mouse movement
page.mouse.move(random.randint(100, 700), random.randint(100, 500))
page.mouse.move(random.randint(200, 800), random.randint(150, 600))
page.goto(url, wait_until="networkidle", timeout=30000)
# Human-like scroll
for _ in range(random.randint(1, 3)):
page.mouse.wheel(0, random.randint(200, 500))
time.sleep(random.uniform(0.3, 0.8))
content = page.content()
browser.close()
return content
6. Error Handling and Graceful Degradation
import time
import random
def scrape_with_fallback(url, max_attempts=4):
"""
Escalate through scraping methods on failure.
Method 1: Simple requests (fastest, most likely to work)
Method 2: Requests with full browser headers
Method 3: Playwright headless browser
Method 4: Bypass API (ScraperAPI, ZenRows)
"""
# Method 1: Simple
for attempt in range(max_attempts):
try:
r = requests.get(url, timeout=10)
if r.status_code == 200:
return {"success": True, "method": "simple", "content": r.text}
elif r.status_code in (403, 429):
time.sleep(random.uniform(5, 15))
continue
else:
return {"success": False, "status": r.status_code}
except Exception:
time.sleep(random.uniform(2, 5))
# Method 2: Full headers + session
for attempt in range(2):
try:
session = RotatingSession(max_requests=5)
r = session.get(url, timeout=15)
if r.status_code == 200:
return {"success": True, "method": "headers+session", "content": r.text}
except Exception:
time.sleep(random.uniform(3, 7))
# Method 3: Browser (expensive but reliable)
try:
content = scrape_browser(url)
return {"success": True, "method": "browser", "content": content}
except Exception:
pass
return {"success": False, "error": "all methods failed"}
The Apify Approach: Pay for Reliability
All of the above takes time to build and maintain. If your time is worth anything, use Apify actors — they handle the entire anti-detection stack for you.
Our actors use headless browser automation with integrated proxy rotation, session management, and automatic retry logic. You pass in a URL or search query; you get back clean structured data.
contact-info-scraper (831 runs)
Extracts emails, phone numbers, LinkedIn URLs, and social profiles from any business website. Handles Cloudflare, SiteLock, and other common protection systems. Best for B2B lead generation and sales intelligence pipelines.
import requests
result = requests.post(
"https://api.apify.com/v2/acts/lanky_quantifier~contact-info-scraper/runs",
json={"input": {"url": "https://example.com"}},
headers={"Authorization": f"Bearer {APIFY_API_TOKEN}"}
).json()
# Wait for completion, fetch dataset
# Returns: emails, phones, social_links, company_info
linkedin-job-scraper (14 runs)
Extracts job postings from LinkedIn with salary ranges, requirements, and company info. Handles LinkedIn's aggressive bot protection through integrated residential proxy rotation.
google-serp-scraper (30 runs)
Returns structured search results from Google without triggering rate limiting or CAPTCHA. Returns titles, URLs, snippets, and rich results.
google-maps-scraper (8 runs)
Scrapes business listings from Google Maps including reviews, ratings, phone numbers, and addresses. Bypasses Maps' anti-bot layer.
Architecture: What a Production Pipeline Looks Like
Target Site
│
├──► Cloudflare / anti-bot layer
│
▼
Proxy Layer (residential + mobile IPs, rotating)
│
▼
Apify Actor (headless browser + built-in retry)
│
▼
Your Database (clean structured data)
│
▼
Your Application (dashboards, alerts, integrations)
For 95%+ of scraping use cases:
- Apify actor handles the hard part ($0.05–$0.50/run)
- You get clean structured JSON, not HTML you have to parse
- No proxy management, no browser automation maintenance
- Actors update when sites change their anti-bot measures
Cost Reality Check
| Approach | Setup Time | Monthly Cost | Reliability | Best For |
|---|---|---|---|---|
| requests + headers | 1 hour | $0 | ~30% success at scale | Single pages, one-time |
| requests + proxies | 1 day | $30–$100 | ~70% success | Light production |
| Playwright + proxies | 2 days | $50–$150 | ~85% success | JS-heavy sites |
| Apify actors | 1 hour | $10–$50 | ~90% success | Production at any scale |
| DIY full stack | 2–4 weeks | $200–$500 | ~95% success | Enterprise, custom needs |
The Pains You Avoid
When your scraper gets blocked, you lose:
- Data freshness — Stale data is often useless data
- Engineering time — Debugging blocks, rotating proxies, updating headers
- Reliability — A scraper that works 60% of the time isn't a business tool
- Scale — You can't grow if you're constantly fighting bans
The anti-detection techniques in this guide solve these problems. The investment is in setup and maintenance. For most teams, the right answer is Apify actors for the infrastructure and internal engineering focused on data processing, not bot fighting.
Quick Wins Checklist
Before you build anything complex, verify you're doing these:
- [ ] User-Agent set to a real browser version (and rotating)
- [ ] All standard headers present (Accept, Accept-Language, Connection)
- [ ] Minimum 1–2 second delay between requests
- [ ] Session cookies reused, not a fresh session per request
- [ ] HTTP status codes logged — 403/429 triggers immediate backoff
- [ ] HTTPS only — sites track protocol downgrade as a signal
- [ ] Referrer header set to a plausible previous page
- [ ] For more than 50 pages/hour: residential proxies configured
- [ ] For JavaScript-heavy sites: Playwright or Apify actor
These eight items will take you 2 hours to implement and will eliminate 80% of the blocking issues most scrapers face.
Take the next step
Skip the setup. Production-ready tools for anti-detection scraping:
Apify Scrapers Bundle — $29 one-time
Instant download. Documented. Ready to deploy.
Top comments (0)