agenthustler

Posted on Mar 26

How to Avoid Getting Blocked While Web Scraping in 2026: Complete Guide

#webdev #python #webscraping #tutorial

Getting blocked is the number one frustration in web scraping. You write a perfect parser, test it on 10 pages, deploy it — and within an hour, every request returns a 403 or a CAPTCHA page.

After scraping millions of pages across hundreds of sites, here's everything I've learned about staying unblocked in 2026. These techniques work whether you're using Python, Node.js, or any other language.

Understanding Why You Get Blocked

Before diving into solutions, understand what you're up against. Modern anti-bot systems detect scrapers through:

IP reputation: Too many requests from one IP
Browser fingerprinting: Missing or inconsistent browser signatures
Behavioral analysis: Inhuman request patterns (too fast, too regular)
TLS fingerprinting: HTTP clients have different TLS signatures than real browsers
JavaScript challenges: Checking if a real browser engine is executing JS

Each technique below addresses one or more of these detection vectors.

1. Rotate User Agents (and Do It Properly)

The most basic mistake: using the default python-requests/2.31.0 user agent. Every anti-bot system blocks this immediately.

But just setting a Chrome user agent isn't enough either. You need to rotate through realistic, current user agents.

import random
import requests

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:134.0) "
    "Gecko/20100101 Firefox/134.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 "
    "(KHTML, like Gecko) Version/18.2 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
]

headers = {
    "User-Agent": random.choice(USER_AGENTS),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
}

response = requests.get("https://example.com", headers=headers)

Pro tip: Use ScrapeOps free Fake Browser Headers API to get always-updated, realistic header sets instead of maintaining your own list.

2. Implement Smart Rate Limiting

Hitting a site with 100 requests per second is the fastest way to get banned. Real users don't browse that fast.

import time
import random

def polite_request(url, session, min_delay=1.0, max_delay=3.0):
    delay = random.uniform(min_delay, max_delay)
    time.sleep(delay)

    try:
        response = session.get(url, timeout=15)
        if response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 60))
            print(f"Rate limited. Waiting {retry_after}s...")
            time.sleep(retry_after)
            return session.get(url, timeout=15)
        return response
    except Exception as e:
        print(f"Request failed: {e}")
        return None

Key rules:

Randomize delays (don't use fixed time.sleep(2))
Respect Retry-After headers
Back off exponentially on repeated failures
Scrape during off-peak hours for the target site's timezone

3. Use Residential Proxies

Datacenter IPs are cheap but easily detected. Residential proxies route through real ISP addresses, making your requests look like they come from regular home internet users.

import requests

proxy_url = "http://USER:PASS@proxy.thordata.com:9000"

proxies = {
    "http": proxy_url,
    "https": proxy_url
}

response = requests.get(
    "https://target-site.com/data",
    proxies=proxies,
    headers=headers,
    timeout=30
)

For cost-effective residential proxies, ThorData offers rates starting at $0.60/GB — significantly cheaper than enterprise alternatives while maintaining good IP quality.

If you want managed proxy rotation without configuring it yourself, ScraperAPI handles rotation, retries, and geo-targeting automatically through a simple API call.

4. Set Complete HTTP Headers

A real browser sends 10-15 headers with every request. A scraper using requests.get(url) sends 2-3. Anti-bot systems notice this discrepancy.

import random
from urllib.parse import urlparse

def get_browser_headers(url):
    parsed = urlparse(url)

    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
        "Referer": f"{parsed.scheme}://{parsed.netloc}/",
    }

The Sec-Fetch-* headers are particularly important in 2026 — many anti-bot systems check for these. Missing them is a dead giveaway.

5. Handle JavaScript Rendering

More than 60% of modern websites require JavaScript to render content. If you're only getting empty pages or "Please enable JavaScript" messages, you need a headless browser.

from playwright.sync_api import sync_playwright

def scrape_js_page(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox"
            ]
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36"
            )
        )
        page = context.new_page()

        page.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', "
            "{get: () => undefined});"
        )

        page.goto(url, wait_until="networkidle")
        content = page.content()
        browser.close()
        return content

Important stealth tips:

Remove the navigator.webdriver flag
Set a realistic viewport size
Use --disable-blink-features=AutomationControlled
Add random mouse movements for heavily protected sites

For sites with aggressive anti-bot (Cloudflare, Akamai), consider using ScraperAPI with render=true — they maintain browser farms optimized for bypassing these protections.

6. Solve CAPTCHAs Gracefully

When you hit a CAPTCHA, you have three options:

Avoid it entirely — Better proxies and headers often prevent CAPTCHAs from triggering
Use a CAPTCHA solving service — Services like 2Captcha or Anti-Captcha solve them for $1-3 per 1,000
Use a managed scraping API — Services like ScraperAPI handle CAPTCHAs automatically

import requests
import time
import random

def scrape_with_captcha_retry(url, max_retries=3):
    for attempt in range(max_retries):
        response = requests.get(
            url,
            proxies=get_next_proxy(),
            headers=get_browser_headers(url)
        )

        if "captcha" not in response.text.lower() and response.status_code == 200:
            return response

        print(f"CAPTCHA detected, rotating proxy (attempt {attempt + 1})")
        time.sleep(random.uniform(5, 15))

    return None

7. Respect robots.txt (Mostly)

This isn't just about ethics — it's practical. Sites that see you ignoring robots.txt are more likely to deploy aggressive blocking.

from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

def can_scrape(url, user_agent="*"):
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    try:
        rp.read()
        return rp.can_fetch(user_agent, url)
    except Exception:
        return True

8. Use Session Persistence

Real browsers maintain cookies and sessions. Scrapers that don't look suspicious.

import requests
import time
import random

session = requests.Session()

# First, visit the homepage to get cookies
session.get(
    "https://target-site.com",
    headers=get_browser_headers("https://target-site.com")
)
time.sleep(random.uniform(1, 3))

# Then navigate to the page you actually want
response = session.get(
    "https://target-site.com/data/page-1",
    headers=get_browser_headers("https://target-site.com/data/page-1")
)

This mimics how real users browse: they don't land directly on page 47 of search results — they start from the homepage and navigate.

9. Handle Errors and Adapt

The best scrapers adapt to blocking in real-time:

import requests
import time
import random

def adaptive_scraper(urls, session):
    consecutive_failures = 0
    base_delay = 1.0

    for url in urls:
        delay = base_delay * (2 ** min(consecutive_failures, 5))
        delay += random.uniform(0, delay * 0.5)
        time.sleep(delay)

        headers = get_browser_headers(url)
        response = session.get(url, headers=headers, timeout=15)

        if response.status_code == 200:
            consecutive_failures = 0
            yield url, response
        elif response.status_code in (403, 429, 503):
            consecutive_failures += 1
            print(
                f"Blocked ({response.status_code}). "
                f"Backing off {delay:.1f}s. "
                f"Failures: {consecutive_failures}"
            )
            if consecutive_failures >= 5:
                print("Too many failures. Rotating proxy/session...")
                session = create_new_session()
                consecutive_failures = 0

Putting It All Together

Here's a production-ready scraping template combining all techniques above:

import requests
import random
import time
from dataclasses import dataclass

@dataclass
class ScraperConfig:
    min_delay: float = 1.5
    max_delay: float = 4.0
    max_retries: int = 3
    proxy_url: str = None

def create_scraper(config: ScraperConfig):
    session = requests.Session()
    if config.proxy_url:
        session.proxies = {
            "http": config.proxy_url,
            "https": config.proxy_url,
        }
    return session

def scrape_url(url, session, config):
    for attempt in range(config.max_retries):
        delay = random.uniform(config.min_delay, config.max_delay)
        time.sleep(delay)

        headers = get_browser_headers(url)
        try:
            response = session.get(url, headers=headers, timeout=15)
            if response.status_code == 200:
                return response
            if response.status_code == 429:
                wait = int(response.headers.get("Retry-After", 30))
                time.sleep(wait)
        except requests.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")

    return None

# Usage
config = ScraperConfig(
    proxy_url="http://USER:PASS@proxy.thordata.com:9000"
)
session = create_scraper(config)

for url in target_urls:
    result = scrape_url(url, session, config)
    if result:
        process(result)

When to Use Managed Scraping Platforms

If you're scraping at scale (thousands of pages daily), managing proxies, headers, and anti-bot evasion yourself becomes a full-time job. That's where managed platforms shine.

Apify provides ready-made scraping actors with built-in proxy rotation, retry logic, and data storage. For common scraping targets, using a pre-built actor is faster and cheaper than building from scratch.

For proxy-specific management, ThorData gives you affordable residential proxies, while ScrapeOps adds monitoring on top so you can see exactly which proxies and techniques are working.

Summary Checklist

Before deploying any scraper, verify you have:

Rotating, up-to-date user agents
Complete browser-like headers (including Sec-Fetch headers)
Randomized delays between requests
Residential proxy rotation for sensitive targets
JavaScript rendering capability for SPA sites
Error handling with exponential backoff
Session persistence with cookies
CAPTCHA handling strategy
robots.txt awareness

Web scraping is an arms race, but these fundamentals haven't changed much over the years. Master them, and you'll successfully scrape 95% of websites without issues.

What's your biggest scraping challenge in 2026? Drop a comment below — happy to help troubleshoot.

DEV Community