DEV Community

Anna lilith
Anna lilith

Posted on

Python Web Scraping Without Getting Blocked: Complete 2026 Guide

Python Web Scraping Without Getting Blocked: Complete 2026 Guide

Last updated: July 2026

Web scraping is essential for data collection, price monitoring, and research. But most websites actively block scrapers. Here's how to scrape effectively without getting banned.

Why Websites Block Scrapers

Websites detect scraping through:

  • Rate limiting: Too many requests too fast
  • User-Agent detection: Missing or suspicious headers
  • IP fingerprinting: Same IP making unusual patterns
  • Behavior analysis: No mouse movements, rapid page loads
  • CAPTCHA challenges: Triggered by suspicious activity

The Right Way to Scrape

1. Respect Robots.txt

Always check robots.txt first:

import requests
from urllib.parse import urlparse

def can_scrape(url, user_agent="*"):
    """Check if scraping is allowed."""
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    try:
        resp = requests.get(robots_url, timeout=5)
        if resp.status_code == 200:
            # Simple check - look for Disallow rules
            if "Disallow: /" in resp.text:
                return False
    except:
        pass
    return True
Enter fullscreen mode Exit fullscreen mode

2. Use Proper Headers

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

def scrape(url):
    """Scrape with proper headers."""
    return requests.get(url, headers=HEADERS, timeout=10)
Enter fullscreen mode Exit fullscreen mode

3. Add Delays Between Requests

import time
import random

def polite_scrape(urls):
    """Scrape multiple URLs with delays."""
    results = []
    for url in urls:
        if not can_scrape(url):
            print(f"Skipping {url} (not allowed)")
            continue

        result = scrape(url)
        results.append(result)

        # Random delay between 1-3 seconds
        delay = random.uniform(1, 3)
        time.sleep(delay)

    return results
Enter fullscreen mode Exit fullscreen mode

4. Rotate User Agents

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36...",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36...",
]

def get_random_headers():
    """Get headers with random User-Agent."""
    headers = HEADERS.copy()
    headers["User-Agent"] = random.choice(USER_AGENTS)
    return headers
Enter fullscreen mode Exit fullscreen mode

5. Handle Rate Limits

from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session():
    """Create a session with retry logic."""
    session = requests.Session()
    retry = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session
Enter fullscreen mode Exit fullscreen mode

Advanced Techniques

Using Proxies

PROXIES = [
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://proxy3:8080",
]

def scrape_with_proxy(url):
    """Scrape using a random proxy."""
    proxy = random.choice(PROXIES)
    return requests.get(url, headers=get_random_headers(), 
                       proxies={"http": proxy, "https": proxy}, timeout=10)
Enter fullscreen mode Exit fullscreen mode

Session Management

def create_session():
    """Create a browser-like session."""
    session = requests.Session()
    session.headers.update(get_random_headers())

    # Visit homepage first
    session.get("https://example.com/", timeout=10)
    time.sleep(2)

    return session
Enter fullscreen mode Exit fullscreen mode

Legal Considerations

  • Check Terms of Service: Some sites explicitly prohibit scraping
  • Respect rate limits: Don't overwhelm servers
  • Don't scrape personal data: Privacy laws apply
  • Use public data only: Don't bypass authentication

Get the Production-Ready Version

We have a complete web scraping toolkit with all these techniques built-in at our store.

What's included:

  • Rotating user agents and proxies
  • Automatic rate limiting
  • Session management
  • Retry logic
  • CAPTCHA detection

Browse the collection →

Top comments (0)