Web Scraping With Rotating Proxies: Complete Setup Guide
If you have ever had your scraper blocked after a few hundred requests, you know the pain. Rotating proxies are the solution — they automatically cycle through different IP addresses so your requests appear to come from different users.
This guide covers proxy types, rotation strategies, and working Python code to build a robust scraping setup.
Why You Need Rotating Proxies
Websites detect and block scrapers using several signals:
- IP frequency: Too many requests from one IP
- Geographic patterns: Requests from data center IPs
- Behavioral analysis: Non-human request patterns
- Rate limiting: Hard caps on requests per IP
Rotating proxies solve all of these by distributing your requests across hundreds or thousands of IPs.
Residential vs Datacenter Proxies
| Feature | Residential | Datacenter |
|---|---|---|
| IP Source | Real ISP connections | Cloud servers |
| Detection Rate | Very low | Higher |
| Speed | Moderate | Fast |
| Cost | $5-15/GB | $1-3/GB |
| Best For | Protected sites | Simple targets |
| Reliability | High | Moderate |
Bottom line: Use residential proxies for sites with anti-bot protection. Use datacenter proxies for simple targets where speed matters more than stealth.
ThorData offers both residential and datacenter proxies with automatic rotation, making it easy to switch between them based on your target.
Basic Proxy Rotation in Python
Here is a simple proxy rotator using a list of proxies:
import requests
import random
import time
from itertools import cycle
class ProxyRotator:
def __init__(self, proxies):
self.proxies = proxies
self.proxy_pool = cycle(proxies)
self.failed = set()
def get_next_proxy(self):
"""Get next working proxy from the pool."""
for _ in range(len(self.proxies)):
proxy = next(self.proxy_pool)
if proxy not in self.failed:
return proxy
# All proxies failed, reset and try again
self.failed.clear()
return next(self.proxy_pool)
def fetch(self, url, max_retries=3):
"""Fetch URL with automatic proxy rotation on failure."""
for attempt in range(max_retries):
proxy = self.get_next_proxy()
try:
resp = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=15,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
)
if resp.status_code == 200:
return resp
elif resp.status_code == 429:
# Rate limited, mark proxy as failed
self.failed.add(proxy)
time.sleep(2)
except requests.exceptions.RequestException:
self.failed.add(proxy)
return None
# Usage
proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
]
rotator = ProxyRotator(proxies)
response = rotator.fetch("https://example.com")
Smart Rotation With Backoff
Naive round-robin rotation is not enough for serious scraping. Here is a smarter approach with exponential backoff and proxy scoring:
import time
import random
from collections import defaultdict
class SmartProxyRotator:
def __init__(self, proxies):
self.proxies = proxies
self.scores = defaultdict(lambda: 100) # Start at 100
self.last_used = defaultdict(float)
self.cooldown = 2.0 # Minimum seconds between uses
def select_proxy(self):
"""Select best available proxy based on score and cooldown."""
now = time.time()
available = [
p for p in self.proxies
if now - self.last_used[p] >= self.cooldown
]
if not available:
time.sleep(self.cooldown)
available = self.proxies
# Weighted random selection based on scores
weights = [max(self.scores[p], 1) for p in available]
proxy = random.choices(available, weights=weights, k=1)[0]
self.last_used[proxy] = now
return proxy
def report_success(self, proxy):
"""Increase proxy score on success."""
self.scores[proxy] = min(self.scores[proxy] + 10, 100)
def report_failure(self, proxy):
"""Decrease proxy score on failure."""
self.scores[proxy] = max(self.scores[proxy] - 30, 0)
def fetch(self, url):
"""Fetch with smart proxy selection."""
for _ in range(5):
proxy = self.select_proxy()
try:
resp = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=15
)
if resp.status_code == 200:
self.report_success(proxy)
return resp
else:
self.report_failure(proxy)
except requests.exceptions.RequestException:
self.report_failure(proxy)
return None
Using ThorData Residential Proxies
Instead of managing proxy lists yourself, ThorData handles rotation automatically. You connect to a single endpoint and each request gets a different residential IP:
import requests
THORDATA_PROXY = "http://username:password@proxy.thordata.com:9000"
def scrape_with_thordata(urls):
"""Scrape multiple URLs with automatic IP rotation."""
session = requests.Session()
session.proxies = {
"http": THORDATA_PROXY,
"https": THORDATA_PROXY
}
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
results = []
for url in urls:
try:
resp = session.get(url, timeout=20)
results.append({
"url": url,
"status": resp.status_code,
"content": resp.text[:500]
})
except requests.exceptions.RequestException as e:
results.append({"url": url, "error": str(e)})
time.sleep(1) # Respectful pacing
return results
# Scrape 100 product pages
urls = [f"https://example.com/product/{i}" for i in range(100)]
data = scrape_with_thordata(urls)
print(f"Success rate: {sum(1 for d in data if 'content' in d)}/{len(data)}")
Rate Limiting Best Practices
Even with rotating proxies, you should implement rate limiting to be respectful and avoid detection:
import time
import random
class RateLimiter:
def __init__(self, requests_per_minute=30, jitter=0.5):
self.interval = 60.0 / requests_per_minute
self.jitter = jitter
self.last_request = 0
def wait(self):
"""Wait appropriate time before next request."""
elapsed = time.time() - self.last_request
delay = self.interval - elapsed
if delay > 0:
# Add random jitter to look more human
actual_delay = delay + random.uniform(0, self.jitter)
time.sleep(actual_delay)
self.last_request = time.time()
# Usage
limiter = RateLimiter(requests_per_minute=20)
for url in urls:
limiter.wait()
response = rotator.fetch(url)
Combining Proxies With ScraperAPI
For sites with heavy anti-bot protection (Cloudflare, DataDome), proxy rotation alone may not be enough. ScraperAPI combines proxy rotation with browser rendering and CAPTCHA solving:
SCRAPERAPI_KEY = "your_key"
def scrape_protected_site(url):
"""Use ScraperAPI for heavily protected sites."""
api_url = f"http://api.scraperapi.com?api_key={SCRAPERAPI_KEY}&url={url}&render=true"
resp = requests.get(api_url, timeout=60)
return resp.text if resp.status_code == 200 else None
Proxy Rotation Checklist
- Choose the right proxy type — residential for protected sites, datacenter for simple ones
- Implement smart rotation — score-based selection, not just round-robin
- Add rate limiting — 20-30 requests per minute is a safe starting point
- Use random delays — jitter makes your traffic pattern look more natural
- Monitor success rates — drop below 90% means you need to adjust
- Rotate user agents — combine IP rotation with header rotation
- Handle failures gracefully — retry with different proxies, not the same one
Conclusion
Rotating proxies are essential for any serious web scraping project. Start with ThorData residential proxies for automatic rotation, add smart scoring and rate limiting in your Python code, and use ScraperAPI when you need CAPTCHA solving and JavaScript rendering on top.
The key is combining good proxy infrastructure with respectful scraping practices — rotate IPs, add delays, and handle errors gracefully.
Follow me for more web scraping tutorials and proxy management guides.
Top comments (0)