Python Web Scraping Without Getting Blocked: Complete 2026 Guide
Last updated: July 2026
Web scraping is essential for data collection, price monitoring, and research. But most websites actively block scrapers. Here's how to scrape effectively without getting banned.
Why Websites Block Scrapers
Websites detect scraping through:
- Rate limiting: Too many requests too fast
- User-Agent detection: Missing or suspicious headers
- IP fingerprinting: Same IP making unusual patterns
- Behavior analysis: No mouse movements, rapid page loads
- CAPTCHA challenges: Triggered by suspicious activity
The Right Way to Scrape
1. Respect Robots.txt
Always check robots.txt first:
import requests
from urllib.parse import urlparse
def can_scrape(url, user_agent="*"):
"""Check if scraping is allowed."""
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
try:
resp = requests.get(robots_url, timeout=5)
if resp.status_code == 200:
# Simple check - look for Disallow rules
if "Disallow: /" in resp.text:
return False
except:
pass
return True
2. Use Proper Headers
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
def scrape(url):
"""Scrape with proper headers."""
return requests.get(url, headers=HEADERS, timeout=10)
3. Add Delays Between Requests
import time
import random
def polite_scrape(urls):
"""Scrape multiple URLs with delays."""
results = []
for url in urls:
if not can_scrape(url):
print(f"Skipping {url} (not allowed)")
continue
result = scrape(url)
results.append(result)
# Random delay between 1-3 seconds
delay = random.uniform(1, 3)
time.sleep(delay)
return results
4. Rotate User Agents
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36...",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36...",
]
def get_random_headers():
"""Get headers with random User-Agent."""
headers = HEADERS.copy()
headers["User-Agent"] = random.choice(USER_AGENTS)
return headers
5. Handle Rate Limits
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session():
"""Create a session with retry logic."""
session = requests.Session()
retry = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
Advanced Techniques
Using Proxies
PROXIES = [
"http://proxy1:8080",
"http://proxy2:8080",
"http://proxy3:8080",
]
def scrape_with_proxy(url):
"""Scrape using a random proxy."""
proxy = random.choice(PROXIES)
return requests.get(url, headers=get_random_headers(),
proxies={"http": proxy, "https": proxy}, timeout=10)
Session Management
def create_session():
"""Create a browser-like session."""
session = requests.Session()
session.headers.update(get_random_headers())
# Visit homepage first
session.get("https://example.com/", timeout=10)
time.sleep(2)
return session
Legal Considerations
- Check Terms of Service: Some sites explicitly prohibit scraping
- Respect rate limits: Don't overwhelm servers
- Don't scrape personal data: Privacy laws apply
- Use public data only: Don't bypass authentication
Get the Production-Ready Version
We have a complete web scraping toolkit with all these techniques built-in at our store.
What's included:
- Rotating user agents and proxies
- Automatic rate limiting
- Session management
- Retry logic
- CAPTCHA detection
Top comments (0)