Vhub Systems

Posted on Apr 3

Proxy Rotation for Web Scraping: Residential, Datacenter, Sticky Sessions Explained

#webscraping #python #javascript #tutorial

Every serious web scraper eventually hits the same problem: your IP gets blocked after 100 requests. Here's how to build proxy rotation that actually works in 2026.

Why IPs get blocked

Websites ban IPs when they detect:

Too many requests from one IP in a short window (rate limiting)
Requests with no browser fingerprint (pure HTTP clients)
Requests from known datacenter IP ranges
Missing cookies or session context
Behavioral anomalies (too fast, too regular)

Proxy rotation solves the first problem. It doesn't fully solve the others — but it's the foundation.

Types of proxies (and which to use)

Datacenter proxies: Fast, cheap (~$1-5/GB), but blocked by most major sites. LinkedIn, Amazon, and Cloudflare-protected sites detect these instantly. Use for sites without serious anti-bot.

Residential proxies: IPs from real home internet connections. Much harder to detect. Expensive (~$5-15/GB). Required for major platforms.

Mobile proxies: 4G/5G IPs. Highest trust, most expensive (~$15-30/GB). Use only when residential isn't working.

ISP proxies: Residential IPs that behave like datacenter (more stable). Good middle ground for sites that check IP reputation but not behavior.

For 2026 scraping, you need residential for anything serious.

Basic rotation in Python

import requests
from itertools import cycle
import time

# Your proxy list
proxies = [
    "http://user:pass@proxy1.provider.com:8080",
    "http://user:pass@proxy2.provider.com:8080",
    "http://user:pass@proxy3.provider.com:8080",
]

proxy_pool = cycle(proxies)

def make_request(url: str, retries: int = 3) -> requests.Response:
    for attempt in range(retries):
        proxy = next(proxy_pool)
        try:
            response = requests.get(
                url,
                proxies={"http": proxy, "https": proxy},
                timeout=10,
                headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/122.0.0.0"}
            )
            if response.status_code == 200:
                return response
            elif response.status_code == 403:
                print(f"Blocked on proxy {proxy[:30]}... trying next")
                continue
        except requests.exceptions.ProxyError:
            print(f"Proxy failed: {proxy[:30]}...")
            continue

    raise Exception(f"All retries failed for {url}")

Smarter rotation with health tracking

Track which proxies work and skip dead ones:

import requests
import random
from collections import defaultdict
from typing import Optional

class ProxyPool:
    def __init__(self, proxy_list: list):
        self.proxies = proxy_list
        self.failures = defaultdict(int)
        self.successes = defaultdict(int)
        self.MAX_FAILURES = 3

    def get_proxy(self) -> Optional[str]:
        # Filter out proxies with too many failures
        available = [
            p for p in self.proxies
            if self.failures[p] < self.MAX_FAILURES
        ]

        if not available:
            # Reset all proxies if all have failed
            self.failures.clear()
            available = self.proxies

        # Weight by success rate
        weights = []
        for p in available:
            success = self.successes[p] + 1  # +1 to avoid div by zero
            failure = self.failures[p] + 1
            weights.append(success / (success + failure))

        return random.choices(available, weights=weights)[0]

    def report_success(self, proxy: str):
        self.successes[proxy] += 1

    def report_failure(self, proxy: str):
        self.failures[proxy] += 1

    def request(self, url: str, **kwargs) -> requests.Response:
        proxy = self.get_proxy()
        proxy_dict = {"http": proxy, "https": proxy}

        try:
            response = requests.get(url, proxies=proxy_dict, **kwargs)
            if response.status_code in (403, 429):
                self.report_failure(proxy)
            else:
                self.report_success(proxy)
            return response
        except Exception:
            self.report_failure(proxy)
            raise

# Usage
pool = ProxyPool([
    "http://user:pass@residential1.example.com:8080",
    "http://user:pass@residential2.example.com:8080",
])

for url in target_urls:
    response = pool.request(url, timeout=15)
    # Process response

Provider rotation (rotating gateway)

Most residential proxy providers offer a "rotating gateway" — one endpoint that automatically cycles IPs:

import requests

# Dataimpulse rotating gateway example
PROXY = "http://username:password@gw.dataimpulse.com:823"

def scrape_with_rotating_proxy(url: str) -> str:
    response = requests.get(
        url,
        proxies={"http": PROXY, "https": PROXY},
        headers={"User-Agent": "Mozilla/5.0 Chrome/122.0.0.0"},
        timeout=30
    )
    return response.text

# Every request gets a different IP automatically
for url in urls_to_scrape:
    html = scrape_with_rotating_proxy(url)

This is simpler than managing a proxy list — the provider handles rotation.

Session-based rotation (sticky proxies)

Some scraping requires the same IP across multiple requests (login flows, multi-page sessions):

# Sticky session — same IP for 10 minutes
STICKY_PROXY = "http://username-country-US-session-abc123:password@gw.provider.com:823"

session = requests.Session()
session.proxies = {"http": STICKY_PROXY, "https": STICKY_PROXY}

# Step 1: Get homepage (establishes cookies)
session.get("https://target-site.com")

# Step 2: Login (same IP as step 1)
session.post("https://target-site.com/login", data={"email": "...", "password": "..."})

# Step 3: Scrape protected pages (same IP, authenticated session)
data = session.get("https://target-site.com/data")

Combined with TLS fingerprinting

Proxies fix the IP problem. But many sites also check TLS fingerprint (which identifies Python's requests library):

from curl_cffi import requests as cf_requests

# curl_cffi: residential proxy + Chrome TLS fingerprint
response = cf_requests.get(
    "https://protected-site.com",
    proxies={"https": "http://user:pass@residential.provider.com:8080"},
    impersonate="chrome124",  # Chrome TLS fingerprint
    timeout=30
)

This combination (residential proxy + Chrome TLS fingerprint) handles ~80% of anti-bot systems.

What to expect at scale

Rough success rates by proxy type + target:

Target	Datacenter	Residential	Mobile
Simple sites	90%+	99%+	99%+
Amazon	5-20%	85-95%	95%+
LinkedIn	<5%	60-75%	80-90%
Cloudflare sites	10-30%	70-85%	85-95%

Success rates drop significantly without session warm-up and realistic browsing patterns.

The managed alternative

If managing proxy pools feels like a full-time job, managed actors handle it:

The Apify Scrapers Bundle ($29) includes pre-built actors for the major platforms that handle proxy rotation, TLS fingerprinting, and session management internally. Pay-per-result means you don't pay for failed requests.

Key takeaways

Datacenter proxies: fine for most sites, blocked by major platforms
Residential proxies: required for Amazon, LinkedIn, Cloudflare
Rotating gateways > managing proxy lists (simpler, more reliable)
Sticky sessions: use when scraping multi-page flows
Pair with curl_cffi for TLS fingerprinting
Track proxy health to skip dead endpoints

n8n AI Automation Pack ($39) — 5 production-ready workflows

Ready-to-Use Scrapers

Pre-built actors that handle proxy rotation automatically:

DEV Community