DEV Community

agenthustler
agenthustler

Posted on

Web Scraping With Rotating Proxies: Complete Setup Guide

Web Scraping With Rotating Proxies: Complete Setup Guide

If you have ever had your scraper blocked after a few hundred requests, you know the pain. Rotating proxies are the solution — they automatically cycle through different IP addresses so your requests appear to come from different users.

This guide covers proxy types, rotation strategies, and working Python code to build a robust scraping setup.

Why You Need Rotating Proxies

Websites detect and block scrapers using several signals:

  • IP frequency: Too many requests from one IP
  • Geographic patterns: Requests from data center IPs
  • Behavioral analysis: Non-human request patterns
  • Rate limiting: Hard caps on requests per IP

Rotating proxies solve all of these by distributing your requests across hundreds or thousands of IPs.

Residential vs Datacenter Proxies

Feature Residential Datacenter
IP Source Real ISP connections Cloud servers
Detection Rate Very low Higher
Speed Moderate Fast
Cost $5-15/GB $1-3/GB
Best For Protected sites Simple targets
Reliability High Moderate

Bottom line: Use residential proxies for sites with anti-bot protection. Use datacenter proxies for simple targets where speed matters more than stealth.

ThorData offers both residential and datacenter proxies with automatic rotation, making it easy to switch between them based on your target.

Basic Proxy Rotation in Python

Here is a simple proxy rotator using a list of proxies:

import requests
import random
import time
from itertools import cycle

class ProxyRotator:
    def __init__(self, proxies):
        self.proxies = proxies
        self.proxy_pool = cycle(proxies)
        self.failed = set()

    def get_next_proxy(self):
        """Get next working proxy from the pool."""
        for _ in range(len(self.proxies)):
            proxy = next(self.proxy_pool)
            if proxy not in self.failed:
                return proxy
        # All proxies failed, reset and try again
        self.failed.clear()
        return next(self.proxy_pool)

    def fetch(self, url, max_retries=3):
        """Fetch URL with automatic proxy rotation on failure."""
        for attempt in range(max_retries):
            proxy = self.get_next_proxy()
            try:
                resp = requests.get(
                    url,
                    proxies={"http": proxy, "https": proxy},
                    timeout=15,
                    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
                )
                if resp.status_code == 200:
                    return resp
                elif resp.status_code == 429:
                    # Rate limited, mark proxy as failed
                    self.failed.add(proxy)
                    time.sleep(2)
            except requests.exceptions.RequestException:
                self.failed.add(proxy)
        return None

# Usage
proxies = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]
rotator = ProxyRotator(proxies)
response = rotator.fetch("https://example.com")
Enter fullscreen mode Exit fullscreen mode

Smart Rotation With Backoff

Naive round-robin rotation is not enough for serious scraping. Here is a smarter approach with exponential backoff and proxy scoring:

import time
import random
from collections import defaultdict

class SmartProxyRotator:
    def __init__(self, proxies):
        self.proxies = proxies
        self.scores = defaultdict(lambda: 100)  # Start at 100
        self.last_used = defaultdict(float)
        self.cooldown = 2.0  # Minimum seconds between uses

    def select_proxy(self):
        """Select best available proxy based on score and cooldown."""
        now = time.time()
        available = [
            p for p in self.proxies 
            if now - self.last_used[p] >= self.cooldown
        ]
        if not available:
            time.sleep(self.cooldown)
            available = self.proxies

        # Weighted random selection based on scores
        weights = [max(self.scores[p], 1) for p in available]
        proxy = random.choices(available, weights=weights, k=1)[0]
        self.last_used[proxy] = now
        return proxy

    def report_success(self, proxy):
        """Increase proxy score on success."""
        self.scores[proxy] = min(self.scores[proxy] + 10, 100)

    def report_failure(self, proxy):
        """Decrease proxy score on failure."""
        self.scores[proxy] = max(self.scores[proxy] - 30, 0)

    def fetch(self, url):
        """Fetch with smart proxy selection."""
        for _ in range(5):
            proxy = self.select_proxy()
            try:
                resp = requests.get(
                    url,
                    proxies={"http": proxy, "https": proxy},
                    timeout=15
                )
                if resp.status_code == 200:
                    self.report_success(proxy)
                    return resp
                else:
                    self.report_failure(proxy)
            except requests.exceptions.RequestException:
                self.report_failure(proxy)
        return None
Enter fullscreen mode Exit fullscreen mode

Using ThorData Residential Proxies

Instead of managing proxy lists yourself, ThorData handles rotation automatically. You connect to a single endpoint and each request gets a different residential IP:

import requests

THORDATA_PROXY = "http://username:password@proxy.thordata.com:9000"

def scrape_with_thordata(urls):
    """Scrape multiple URLs with automatic IP rotation."""
    session = requests.Session()
    session.proxies = {
        "http": THORDATA_PROXY,
        "https": THORDATA_PROXY
    }
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    })

    results = []
    for url in urls:
        try:
            resp = session.get(url, timeout=20)
            results.append({
                "url": url,
                "status": resp.status_code,
                "content": resp.text[:500]
            })
        except requests.exceptions.RequestException as e:
            results.append({"url": url, "error": str(e)})
        time.sleep(1)  # Respectful pacing

    return results

# Scrape 100 product pages
urls = [f"https://example.com/product/{i}" for i in range(100)]
data = scrape_with_thordata(urls)
print(f"Success rate: {sum(1 for d in data if 'content' in d)}/{len(data)}")
Enter fullscreen mode Exit fullscreen mode

Rate Limiting Best Practices

Even with rotating proxies, you should implement rate limiting to be respectful and avoid detection:

import time
import random

class RateLimiter:
    def __init__(self, requests_per_minute=30, jitter=0.5):
        self.interval = 60.0 / requests_per_minute
        self.jitter = jitter
        self.last_request = 0

    def wait(self):
        """Wait appropriate time before next request."""
        elapsed = time.time() - self.last_request
        delay = self.interval - elapsed
        if delay > 0:
            # Add random jitter to look more human
            actual_delay = delay + random.uniform(0, self.jitter)
            time.sleep(actual_delay)
        self.last_request = time.time()

# Usage
limiter = RateLimiter(requests_per_minute=20)
for url in urls:
    limiter.wait()
    response = rotator.fetch(url)
Enter fullscreen mode Exit fullscreen mode

Combining Proxies With ScraperAPI

For sites with heavy anti-bot protection (Cloudflare, DataDome), proxy rotation alone may not be enough. ScraperAPI combines proxy rotation with browser rendering and CAPTCHA solving:

SCRAPERAPI_KEY = "your_key"

def scrape_protected_site(url):
    """Use ScraperAPI for heavily protected sites."""
    api_url = f"http://api.scraperapi.com?api_key={SCRAPERAPI_KEY}&url={url}&render=true"
    resp = requests.get(api_url, timeout=60)
    return resp.text if resp.status_code == 200 else None
Enter fullscreen mode Exit fullscreen mode

Proxy Rotation Checklist

  1. Choose the right proxy type — residential for protected sites, datacenter for simple ones
  2. Implement smart rotation — score-based selection, not just round-robin
  3. Add rate limiting — 20-30 requests per minute is a safe starting point
  4. Use random delays — jitter makes your traffic pattern look more natural
  5. Monitor success rates — drop below 90% means you need to adjust
  6. Rotate user agents — combine IP rotation with header rotation
  7. Handle failures gracefully — retry with different proxies, not the same one

Conclusion

Rotating proxies are essential for any serious web scraping project. Start with ThorData residential proxies for automatic rotation, add smart scoring and rate limiting in your Python code, and use ScraperAPI when you need CAPTCHA solving and JavaScript rendering on top.

The key is combining good proxy infrastructure with respectful scraping practices — rotate IPs, add delays, and handle errors gracefully.


Follow me for more web scraping tutorials and proxy management guides.

Top comments (0)