DEV Community

Vhub Systems
Vhub Systems

Posted on

Web Scraping Without Bans: The Definitive 2026 Anti-Detection Playbook

Web Scraping Without Bans: The Definitive 2026 Anti-Detection Playbook

You built a scraper. It works. Then, slowly or suddenly, it stops. 403s. CAPTCHAs. Infinite redirects. The target site learned your pattern.

This is the reality of web scraping at scale. The techniques in this guide are the result of building and running 30+ production scrapers — from contact info extractors handling 831 runs to LinkedIn job scrapers navigating aggressive bot protection. This is what actually works to stay operational.


How Sites Detect Scrapers (The Attacker's View)

Before defending, you need to understand the detection stack:

Layer 1: Network Layer

  • IP reputation — Datacenter IPs are flagged immediately. AWS, DigitalOcean, Linode IPs are in known ranges.
  • Geographic inconsistency — If your IP claims to be in Germany but your TLS fingerprint is from a VPN exit in Romania, that's a signal.
  • ASN history — Cloudflare and Google maintain lists of ASN patterns for cloud providers.

Layer 2: HTTP Protocol Layer

  • TLS fingerprint — Every HTTP client (Python requests, Go net/http, Node axios) has a unique TLS handshake signature. Cloudflare and Akamai fingerprint these.
  • HTTP/2 frame ordering — The sequence of HTTP/2 frames differs between clients.
  • Header ordering and casing — Real browsers send headers in specific orders with specific casing. Content-type vs content-type matters.
  • Missing headers — Real browsers send Accept, Accept-Language, Accept-Encoding, Connection, Upgrade-Insecure-Requests. Missing any of these is a bot signal.

Layer 3: Application Layer

  • Request rate — Humans don't load 50 pages in 3 seconds.
  • Navigation patterns — Real users click links. Scrapers request URLs directly.
  • Missing referrer — Opening a product page without a referrer is unusual.
  • No mouse/click events — JavaScript-heavy sites track actual user interaction.

Layer 4: Behavioral Layer (Hardest to Fake)

  • Mouse movement patterns — Bots move the mouse in straight lines. Real humans move in curves with micro-corrections.
  • Scroll behavior — Instant scrolls vs human scroll deceleration.
  • Time between actions — Real users read content. Bots don't.

The Anti-Detection Stack (In Order of Impact)

1. Rotate Your User-Agent

This is free and blocks 10–15% of naive bot detection:

import random

BROWSER_UAS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.3 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]

HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}

def get_random_headers():
    headers = HEADERS.copy()
    headers["User-Agent"] = random.choice(BROWSER_UAS)
    return headers
Enter fullscreen mode Exit fullscreen mode

2. Session Rotation and Cookie Management

Websites track sessions. A single session making 200 requests in 10 minutes is obviously a bot:

import requests
import time
import random

class RotatingSession:
    def __init__(self, max_requests_per_session=30):
        self.max_requests = max_requests_per_session
        self.sessions = []
        self.current_session = None
        self.request_count = 0
        self._new_session()

    def _new_session(self):
        self.current_session = requests.Session()
        self.current_session.headers.update(get_random_headers())
        self.request_count = 0

    def get(self, url, **kwargs):
        if self.request_count >= self.max_requests:
            self._new_session()

        # Add human-like delay
        time.sleep(random.uniform(1.5, 4.0))

        self.request_count += 1
        return self.current_session.get(url, **kwargs)
Enter fullscreen mode Exit fullscreen mode

3. Rate Limiting — The Most Underrated Fix

The single most effective anti-ban technique is also the simplest: slow down:

import time
import random
from collections import deque
from datetime import datetime, timedelta

class AdaptiveRateLimiter:
    def __init__(self, base_delay=2.0, max_delay=30.0):
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.current_delay = base_delay
        self.success_times = deque(maxlen=20)
        self.failure_times = deque(maxlen=10)

    def wait(self):
        jitter = random.uniform(-0.5, 0.5)
        actual = max(0.5, self.current_delay + jitter)
        time.sleep(actual)

    def record_success(self):
        self.success_times.append(datetime.now())
        if len(self.success_times) >= 10:
            # Gradually reduce delay on sustained success
            self.current_delay = max(self.base_delay, self.current_delay * 0.9)

    def record_failure(self, status_code=None):
        self.failure_times.append(datetime.now())
        if status_code in (403, 429):
            # Sharp increase on blocks
            self.current_delay = min(self.max_delay, self.current_delay * 3)
        else:
            self.current_delay = min(self.max_delay, self.current_delay * 1.5)

    def should_wait_longer(self):
        """Check if last failure was recent."""
        if not self.failure_times:
            return False
        return datetime.now() - self.failure_times[-1] < timedelta(minutes=5)
Enter fullscreen mode Exit fullscreen mode

4. Proxy Rotation (Non-Negotiable for Scale)

If you're scraping more than 50 pages/hour from a single domain, you need proxies. Not debatable.

Proxy hierarchy:

Type Success Rate Cost Use Case
Datacenter 5–20% Free–$0.10/IP Testing only
Shared residential 40–60% $5–$15/GB Light scraping
Dedicated residential 70–85% $10–$30/GB Production scraping
Mobile 4G 85–95% $25–$50/GB Hard targets (LinkedIn, Google)
ISP/s datacenter 60–75% $5–$15/IP/mo Sustained sessions

Integration:

import requests

class ProxyRotator:
    def __init__(self, proxy_provider_api):
        self.api = proxy_provider_api
        self.proxy_list = []
        self.current_index = 0

    def get_proxy(self):
        if not self.proxy_list:
            self._refresh_proxies()

        proxy = self.proxy_list[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.proxy_list)
        return proxy

    def _refresh_proxies(self):
        # Example: fetch from your proxy provider's API
        # This varies by provider (Bright Data, Oxylabs, ScraperAPI, etc.)
        import json
        response = requests.get(self.api, timeout=10)
        data = json.loads(response.text)
        self.proxy_list = data.get("proxies", [])

    def get_with_proxy(self, url, **kwargs):
        proxy = self.get_proxy()
        proxies = {"http": proxy, "https": proxy}
        return requests.get(url, proxies=proxies, **kwargs)
Enter fullscreen mode Exit fullscreen mode

5. Headless Browser for JavaScript-Heavy Sites

For sites that render content with JavaScript, you need a real browser engine:

from playwright.sync_api import sync_playwright
import random

def scrape_browser(url, anti_detection=True):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
                "--disable-setuid-sandbox",
                "--disable-dev-shm-usage",
                "--disable-accelerated-2d-canvas",
                "--no-first-run",
                "--no-zygote",
                "--disable-gpu",
            ]
        )

        context_args = {
            "user_agent": random.choice(BROWSER_UAS),
            "viewport": {"width": random.randint(1280, 1920), "height": random.randint(720, 1080)},
            "locale": "en-US",
            "timezone_id": "America/New_York",
        }

        if anti_detection:
            context_args["viewport"] = {"width": 1920, "height": 1080}

        context = browser.new_context(**context_args)
        page = context.new_page()

        # Human-like mouse movement
        page.mouse.move(random.randint(100, 700), random.randint(100, 500))
        page.mouse.move(random.randint(200, 800), random.randint(150, 600))

        page.goto(url, wait_until="networkidle", timeout=30000)

        # Human-like scroll
        for _ in range(random.randint(1, 3)):
            page.mouse.wheel(0, random.randint(200, 500))
            time.sleep(random.uniform(0.3, 0.8))

        content = page.content()
        browser.close()
        return content
Enter fullscreen mode Exit fullscreen mode

6. Error Handling and Graceful Degradation

import time
import random

def scrape_with_fallback(url, max_attempts=4):
    """
    Escalate through scraping methods on failure.
    Method 1: Simple requests (fastest, most likely to work)
    Method 2: Requests with full browser headers
    Method 3: Playwright headless browser
    Method 4: Bypass API (ScraperAPI, ZenRows)
    """

    # Method 1: Simple
    for attempt in range(max_attempts):
        try:
            r = requests.get(url, timeout=10)
            if r.status_code == 200:
                return {"success": True, "method": "simple", "content": r.text}
            elif r.status_code in (403, 429):
                time.sleep(random.uniform(5, 15))
                continue
            else:
                return {"success": False, "status": r.status_code}
        except Exception:
            time.sleep(random.uniform(2, 5))

    # Method 2: Full headers + session
    for attempt in range(2):
        try:
            session = RotatingSession(max_requests=5)
            r = session.get(url, timeout=15)
            if r.status_code == 200:
                return {"success": True, "method": "headers+session", "content": r.text}
        except Exception:
            time.sleep(random.uniform(3, 7))

    # Method 3: Browser (expensive but reliable)
    try:
        content = scrape_browser(url)
        return {"success": True, "method": "browser", "content": content}
    except Exception:
        pass

    return {"success": False, "error": "all methods failed"}
Enter fullscreen mode Exit fullscreen mode

The Apify Approach: Pay for Reliability

All of the above takes time to build and maintain. If your time is worth anything, use Apify actors — they handle the entire anti-detection stack for you.

Our actors use headless browser automation with integrated proxy rotation, session management, and automatic retry logic. You pass in a URL or search query; you get back clean structured data.

contact-info-scraper (831 runs)

Extracts emails, phone numbers, LinkedIn URLs, and social profiles from any business website. Handles Cloudflare, SiteLock, and other common protection systems. Best for B2B lead generation and sales intelligence pipelines.

import requests

result = requests.post(
    "https://api.apify.com/v2/acts/lanky_quantifier~contact-info-scraper/runs",
    json={"input": {"url": "https://example.com"}},
    headers={"Authorization": f"Bearer {APIFY_API_TOKEN}"}
).json()

# Wait for completion, fetch dataset
# Returns: emails, phones, social_links, company_info
Enter fullscreen mode Exit fullscreen mode

linkedin-job-scraper (14 runs)

Extracts job postings from LinkedIn with salary ranges, requirements, and company info. Handles LinkedIn's aggressive bot protection through integrated residential proxy rotation.

google-serp-scraper (30 runs)

Returns structured search results from Google without triggering rate limiting or CAPTCHA. Returns titles, URLs, snippets, and rich results.

google-maps-scraper (8 runs)

Scrapes business listings from Google Maps including reviews, ratings, phone numbers, and addresses. Bypasses Maps' anti-bot layer.


Architecture: What a Production Pipeline Looks Like

Target Site
    │
    ├──► Cloudflare / anti-bot layer
    │
    ▼
Proxy Layer (residential + mobile IPs, rotating)
    │
    ▼
Apify Actor (headless browser + built-in retry)
    │
    ▼
Your Database (clean structured data)
    │
    ▼
Your Application (dashboards, alerts, integrations)
Enter fullscreen mode Exit fullscreen mode

For 95%+ of scraping use cases:

  1. Apify actor handles the hard part ($0.05–$0.50/run)
  2. You get clean structured JSON, not HTML you have to parse
  3. No proxy management, no browser automation maintenance
  4. Actors update when sites change their anti-bot measures

Cost Reality Check

Approach Setup Time Monthly Cost Reliability Best For
requests + headers 1 hour $0 ~30% success at scale Single pages, one-time
requests + proxies 1 day $30–$100 ~70% success Light production
Playwright + proxies 2 days $50–$150 ~85% success JS-heavy sites
Apify actors 1 hour $10–$50 ~90% success Production at any scale
DIY full stack 2–4 weeks $200–$500 ~95% success Enterprise, custom needs

The Pains You Avoid

When your scraper gets blocked, you lose:

  • Data freshness — Stale data is often useless data
  • Engineering time — Debugging blocks, rotating proxies, updating headers
  • Reliability — A scraper that works 60% of the time isn't a business tool
  • Scale — You can't grow if you're constantly fighting bans

The anti-detection techniques in this guide solve these problems. The investment is in setup and maintenance. For most teams, the right answer is Apify actors for the infrastructure and internal engineering focused on data processing, not bot fighting.


Quick Wins Checklist

Before you build anything complex, verify you're doing these:

  • [ ] User-Agent set to a real browser version (and rotating)
  • [ ] All standard headers present (Accept, Accept-Language, Connection)
  • [ ] Minimum 1–2 second delay between requests
  • [ ] Session cookies reused, not a fresh session per request
  • [ ] HTTP status codes logged — 403/429 triggers immediate backoff
  • [ ] HTTPS only — sites track protocol downgrade as a signal
  • [ ] Referrer header set to a plausible previous page
  • [ ] For more than 50 pages/hour: residential proxies configured
  • [ ] For JavaScript-heavy sites: Playwright or Apify actor

These eight items will take you 2 hours to implement and will eliminate 80% of the blocking issues most scrapers face.


Take the next step

Skip the setup. Production-ready tools for anti-detection scraping:

Apify Scrapers Bundle — $29 one-time

Instant download. Documented. Ready to deploy.

Top comments (0)