DEV Community

Vhub Systems
Vhub Systems

Posted on

Web Scraping Without Getting Banned in 2026: The Complete Anti-Bot Bypass Guide

Web Scraping Without Getting Banned in 2026: The Complete Anti-Bot Bypass Guide

Getting blocked is the #1 frustration for web scrapers. You write the code, it works for 10 minutes, then you're staring at a 403 or a Cloudflare challenge page. This guide covers every technique that actually works in 2026 — from basic rate limiting to defeating Turnstile — so you can scrape 100 to 50,000 records without getting banned.

Why You're Getting Blocked: The Real Reasons

Before fixes, you need to understand what's detecting you. Modern anti-bot systems check multiple signals simultaneously:

TLS Fingerprint — Your Python requests library sends a TLS handshake that looks nothing like a browser. Sites like Cloudflare identify it in milliseconds.

HTTP/2 Fingerprint — Browsers use HTTP/2 with specific frame ordering. requests uses HTTP/1.1 by default, which is an instant giveaway.

Browser fingerprint — Headless Chrome has detectable properties: navigator.webdriver=true, missing plugins, wrong screen dimensions, no GPU renderer.

Behavioral signals — Too fast, too regular, no mouse movement, no scroll events, straight-line navigation patterns.

IP reputation — Datacenter IPs (AWS, GCP, Azure, Hetzner) are pre-blocked on most serious sites. Even residential IPs get flagged if they hit too fast.

Knowing which layer is detecting you tells you what to fix.


Layer 1: TLS and HTTP Fingerprinting (Fix This First)

The single biggest win for most scrapers: stop using plain requests.

Use curl-cffi to impersonate real browsers

from curl_cffi import requests

session = requests.Session()

# Impersonate Chrome 120 — matches real browser TLS fingerprint
response = session.get(
    "https://target-site.com/data",
    impersonate="chrome120"
)

# Other options: chrome110, safari17_0, edge101
print(response.status_code)  # 200 instead of 403
Enter fullscreen mode Exit fullscreen mode

Install: pip install curl-cffi

curl-cffi uses libcurl under the hood and replicates the exact TLS cipher suite order, extension list, and HTTP/2 SETTINGS frame that Chrome sends. Many sites that block requests outright will pass curl-cffi straight through.

httpx with HTTP/2

For sites that check HTTP version but not TLS fingerprint deeply:

import httpx

with httpx.Client(http2=True) as client:
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
    }
    response = client.get("https://target-site.com", headers=headers)
Enter fullscreen mode Exit fullscreen mode

Layer 2: Rotating Proxies the Right Way

Most scrapers use proxies wrong. Here's what actually works:

Proxy type matters

Proxy Type Detection Risk Cost Good For
Datacenter Very High Low Public data, no protection
Residential Low Medium Most protected sites
Mobile (4G) Very Low High Strictest anti-bot
ISP (static residential) Low Medium-High Consistent sessions

For 100-5000 records on a protected site, residential proxies are the sweet spot.

Python proxy rotation with backoff

import requests
import time
import random
from itertools import cycle

PROXIES = [
    "http://user:pass@proxy1:8080",
    "http://user:pass@proxy2:8080",
    "http://user:pass@proxy3:8080",
]

proxy_pool = cycle(PROXIES)

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        proxy = next(proxy_pool)
        try:
            response = requests.get(
                url,
                proxies={"http": proxy, "https": proxy},
                timeout=15,
                headers={"User-Agent": get_random_ua()}
            )
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                # Rate limited — wait and try different proxy
                time.sleep(2 ** attempt + random.uniform(0, 1))
        except requests.RequestException:
            time.sleep(1)
    return None

def get_random_ua():
    agents = [
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    ]
    return random.choice(agents)
Enter fullscreen mode Exit fullscreen mode

Request timing — the key that most guides skip

import time
import random

def human_delay(min_sec=1.5, max_sec=4.0):
    """Simulate human browsing pace"""
    time.sleep(random.uniform(min_sec, max_sec))

# Between pages
human_delay(2, 5)

# Between sites (session warm-up)
human_delay(5, 10)
Enter fullscreen mode Exit fullscreen mode

At 100 records: comfortable at 1 request/2 seconds
At 1000 records: use 3-5 second delays + proxy rotation
At 5000+ records: rotate proxies every 50-100 requests, add session reuse


Layer 3: Playwright for JavaScript-Heavy Sites

When the target site requires JavaScript execution (React, Vue, Angular), you need a real browser:

Stealth Playwright setup

from playwright.sync_api import sync_playwright
import time, random

def create_stealth_browser():
    p = sync_playwright().start()
    browser = p.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--no-sandbox",
            "--disable-dev-shm-usage",
        ]
    )
    context = browser.new_context(
        user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        viewport={"width": 1280, "height": 800},
        locale="en-US",
        timezone_id="America/New_York",
        # Add proxy here if needed:
        # proxy={"server": "http://proxy:8080", "username": "u", "password": "p"}
    )

    # Patch webdriver detection
    context.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
        Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
        window.chrome = {runtime: {}};
    """)

    return p, browser, context

def scrape_page(url):
    p, browser, context = create_stealth_browser()
    page = context.new_page()

    # Human-like: navigate, wait, scroll
    page.goto(url, wait_until="networkidle")
    time.sleep(random.uniform(1.5, 3))

    # Scroll to simulate reading
    page.evaluate("window.scrollBy(0, 300)")
    time.sleep(random.uniform(0.5, 1.5))

    content = page.content()
    browser.close()
    p.stop()
    return content
Enter fullscreen mode Exit fullscreen mode

Install: pip install playwright && playwright install chromium


Layer 4: Defeating Cloudflare Turnstile

Cloudflare Turnstile (the non-interactive "I'm not a robot" check) is the hardest challenge in 2026. It runs JavaScript fingerprinting, behavior analysis, and sometimes visual challenges. There are three viable approaches:

Option A: Avoid it entirely (fastest, free)

Many sites have unprotected API endpoints even when the HTML is protected:

import requests, json

# Instead of scraping the HTML page:
# https://shop.example.com/products

# Try the API directly:
response = requests.get(
    "https://shop.example.com/api/products",
    headers={"Accept": "application/json"}
)

# Or the GraphQL endpoint:
response = requests.post(
    "https://shop.example.com/graphql",
    json={"query": "{ products { id name price } }"}
)
Enter fullscreen mode Exit fullscreen mode

Open your browser's Network tab, filter for XHR/Fetch requests, and look for JSON responses. About 60% of protected sites expose clean APIs this way.

Option B: Turnstile solving services ($0.001-$0.002 per solve)

When you must solve the challenge:

import requests
import time

SOLVER_API_KEY = "your_2captcha_or_anticaptcha_key"

def solve_turnstile(page_url, sitekey):
    # Submit task
    r = requests.post("https://api.2captcha.com/createTask", json={
        "clientKey": SOLVER_API_KEY,
        "task": {
            "type": "TurnstileTaskProxyless",
            "websiteURL": page_url,
            "websiteKey": sitekey
        }
    })
    task_id = r.json()["taskId"]

    # Poll for result
    for _ in range(30):
        time.sleep(5)
        result = requests.post("https://api.2captcha.com/getTaskResult", json={
            "clientKey": SOLVER_API_KEY,
            "taskId": task_id
        }).json()

        if result["status"] == "ready":
            return result["solution"]["token"]

    raise Exception("Solving timeout")

# Use the token in your request
token = solve_turnstile("https://target.com", "0x4AAAAAAABxxxxxxx")
response = requests.post(
    "https://target.com/submit",
    data={"cf-turnstile-response": token, "other_field": "value"}
)
Enter fullscreen mode Exit fullscreen mode

Cost at scale: 1000 solves = ~$1-2. Services: 2captcha, Anti-Captcha, CapSolver.

Option C: Headless browser with stealth

For low-volume scraping where you need to interact with the full page:

from playwright.sync_api import sync_playwright
import time

def bypass_turnstile_page(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)  # False helps pass more checks
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
        )
        context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
        """)
        page = context.new_page()
        page.goto(url)

        # Wait for Turnstile to complete (it auto-solves for real browsers)
        page.wait_for_selector(".cf-turnstile[data-cf-token]", timeout=30000)

        # Now extract whatever you need
        data = page.evaluate("() => document.querySelector('#data').innerText")
        browser.close()
        return data
Enter fullscreen mode Exit fullscreen mode

Layer 5: Session Management and Cookies

The pattern that breaks most scrapers: treating every request as stateless.

import requests
from http.cookiejar import LWPCookieJar
import json

class PersistentSession:
    def __init__(self, proxy=None):
        self.session = requests.Session()
        if proxy:
            self.session.proxies = {"http": proxy, "https": proxy}
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Macintosh...) Chrome/120.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
        })

    def warm_up(self, base_url):
        """Visit homepage first to build cookie session"""
        self.session.get(base_url)
        import time, random
        time.sleep(random.uniform(2, 4))

    def get(self, url):
        return self.session.get(url)

    def save_cookies(self, path):
        with open(path, 'w') as f:
            json.dump(list(self.session.cookies), f)

    def load_cookies(self, path):
        with open(path) as f:
            for cookie in json.load(f):
                self.session.cookies.set(**cookie)

# Usage
scraper = PersistentSession(proxy="http://user:pass@residential-proxy:8080")
scraper.warm_up("https://target-site.com")  # Build session
data = scraper.get("https://target-site.com/data/page/1")
Enter fullscreen mode Exit fullscreen mode

Key points:

  • Always visit the homepage before the target page
  • Reuse the same session for a site (keep cookies)
  • Add Referer headers that match the actual navigation path

Practical Rate Limits by Site Protection Level

Protection Level Examples Safe Request Rate
None Most blogs, news 1 req/sec
Basic (rate limit only) Small e-commerce 1 req/3-5s
Moderate (Cloudflare Basic) Mid-size retail 1 req/5-10s + proxy rotation
Heavy (Turnstile + JS checks) LinkedIn, Amazon Solving service + 10-30s delays
Maximum (behavioral AI) Ticketmaster, airlines Mobile proxies + full browser

Quick Decision Tree: Which Approach to Use

Is the data in a JSON API?
  YES → Use requests + proxy, skip browser entirely
  NO ↓

Does the page require JavaScript to render content?
  NO → Use curl-cffi (impersonate Chrome)
  YES ↓

Is there a Cloudflare Turnstile challenge?
  NO → Use Playwright with stealth patches
  YES ↓

Volume > 1000 requests?
  YES → Use solving service (2captcha/CapSolver)
  NO → Use headless=False Playwright (often auto-solves)
Enter fullscreen mode Exit fullscreen mode

Tools Reference

Tool Use Case Install
curl-cffi TLS fingerprint bypass pip install curl-cffi
httpx[http2] HTTP/2 support pip install httpx[http2]
playwright JavaScript rendering pip install playwright
undetected-chromedriver Selenium alternative pip install undetected-chromedriver
scrapy-rotating-proxies Scrapy proxy rotation pip install scrapy-rotating-proxies

When to Use a Managed Scraping Service

Writing all this yourself makes sense for 1-3 targets you know well. For production scraping across many sites, maintaining anti-bot bypass code becomes its own full-time job — Cloudflare updates every few weeks, browser fingerprints shift, proxy IPs get burned.

A managed approach lets you focus on the data pipeline while the infrastructure handles detection. The tradeoff is cost vs. maintenance time.


Summary

The 80/20 of not getting banned:

  1. Use curl-cffi — fixes TLS fingerprinting immediately, handles 60% of blocks
  2. Add residential proxies — fixes IP reputation, 20% more coverage
  3. Slow down — 2-5 second delays eliminate most rate-limit blocks
  4. Warm up sessions — visit homepage first, reuse cookies, add Referer
  5. Check for hidden APIs — often cleaner than scraping HTML at all

If you're hitting Cloudflare Turnstile specifically, solving services cost ~$1 per 1000 solves and integrate in under 20 lines of Python.

The hardest targets (airlines, ticketing, financial data) need mobile proxies and full behavioral simulation. For most business use cases — competitor data, lead generation, market research — the techniques above are more than enough.


Take the next step

Skip the setup. Production-ready tools for scraping without bans:

Apify Scrapers Bundle — $29 one-time

Instant download. Documented. Ready to deploy.


Related Tools

Top comments (0)