Anna lilith

Posted on Jul 5 • Edited on Jul 6

Python Web Scraping Without Getting Blocked: Complete 2026 Guide

#python #programming

Python Web Scraping Without Getting Blocked: Complete 2026 Guide

Last updated: July 2026

Web scraping is essential for data collection, price monitoring, and research. But most websites actively block scrapers. Here's how to scrape effectively without getting banned.

Why Websites Block Scrapers

Websites detect scraping through:

Rate limiting: Too many requests too fast
User-Agent detection: Missing or suspicious headers
IP fingerprinting: Same IP making unusual patterns
Behavior analysis: No mouse movements, rapid page loads
CAPTCHA challenges: Triggered by suspicious activity

The Right Way to Scrape

1. Respect Robots.txt

Always check robots.txt first:

import requests
from urllib.parse import urlparse

def can_scrape(url, user_agent="*"):
    """Check if scraping is allowed."""
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    try:
        resp = requests.get(robots_url, timeout=5)
        if resp.status_code == 200:
            # Simple check - look for Disallow rules
            if "Disallow: /" in resp.text:
                return False
    except:
        pass
    return True

2. Use Proper Headers

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

def scrape(url):
    """Scrape with proper headers."""
    return requests.get(url, headers=HEADERS, timeout=10)

3. Add Delays Between Requests

import time
import random

def polite_scrape(urls):
    """Scrape multiple URLs with delays."""
    results = []
    for url in urls:
        if not can_scrape(url):
            print(f"Skipping {url} (not allowed)")
            continue

        result = scrape(url)
        results.append(result)

        # Random delay between 1-3 seconds
        delay = random.uniform(1, 3)
        time.sleep(delay)

    return results

4. Rotate User Agents

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36...",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36...",
]

def get_random_headers():
    """Get headers with random User-Agent."""
    headers = HEADERS.copy()
    headers["User-Agent"] = random.choice(USER_AGENTS)
    return headers

5. Handle Rate Limits

from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session():
    """Create a session with retry logic."""
    session = requests.Session()
    retry = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

Advanced Techniques

Using Proxies

PROXIES = [
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://proxy3:8080",
]

def scrape_with_proxy(url):
    """Scrape using a random proxy."""
    proxy = random.choice(PROXIES)
    return requests.get(url, headers=get_random_headers(), 
                       proxies={"http": proxy, "https": proxy}, timeout=10)

Session Management

def create_session():
    """Create a browser-like session."""
    session = requests.Session()
    session.headers.update(get_random_headers())

    # Visit homepage first
    session.get("https://example.com/", timeout=10)
    time.sleep(2)

    return session

Legal Considerations

Check Terms of Service: Some sites explicitly prohibit scraping
Respect rate limits: Don't overwhelm servers
Don't scrape personal data: Privacy laws apply
Use public data only: Don't bypass authentication

Get the Production-Ready Version

We have a complete web scraping toolkit with all these techniques built-in at our store.

What's included:

Rotating user agents and proxies
Automatic rate limiting
Session management
Retry logic
CAPTCHA detection

Browse the collection →

DEV Community

Python Web Scraping Without Getting Blocked: Complete 2026 Guide

Python Web Scraping Without Getting Blocked: Complete 2026 Guide

Why Websites Block Scrapers

The Right Way to Scrape

1. Respect Robots.txt

2. Use Proper Headers

3. Add Delays Between Requests

4. Rotate User Agents

5. Handle Rate Limits

Advanced Techniques

Using Proxies

Session Management

Legal Considerations

Get the Production-Ready Version

Top comments (0)