Every serious web scraper eventually hits the same problem: your IP gets blocked after 100 requests. Here's how to build proxy rotation that actually works in 2026.
Why IPs get blocked
Websites ban IPs when they detect:
- Too many requests from one IP in a short window (rate limiting)
- Requests with no browser fingerprint (pure HTTP clients)
- Requests from known datacenter IP ranges
- Missing cookies or session context
- Behavioral anomalies (too fast, too regular)
Proxy rotation solves the first problem. It doesn't fully solve the others — but it's the foundation.
Types of proxies (and which to use)
Datacenter proxies: Fast, cheap (~$1-5/GB), but blocked by most major sites. LinkedIn, Amazon, and Cloudflare-protected sites detect these instantly. Use for sites without serious anti-bot.
Residential proxies: IPs from real home internet connections. Much harder to detect. Expensive (~$5-15/GB). Required for major platforms.
Mobile proxies: 4G/5G IPs. Highest trust, most expensive (~$15-30/GB). Use only when residential isn't working.
ISP proxies: Residential IPs that behave like datacenter (more stable). Good middle ground for sites that check IP reputation but not behavior.
For 2026 scraping, you need residential for anything serious.
Basic rotation in Python
import requests
from itertools import cycle
import time
# Your proxy list
proxies = [
"http://user:pass@proxy1.provider.com:8080",
"http://user:pass@proxy2.provider.com:8080",
"http://user:pass@proxy3.provider.com:8080",
]
proxy_pool = cycle(proxies)
def make_request(url: str, retries: int = 3) -> requests.Response:
for attempt in range(retries):
proxy = next(proxy_pool)
try:
response = requests.get(
url,
proxies={"http": proxy, "https": proxy},
timeout=10,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/122.0.0.0"}
)
if response.status_code == 200:
return response
elif response.status_code == 403:
print(f"Blocked on proxy {proxy[:30]}... trying next")
continue
except requests.exceptions.ProxyError:
print(f"Proxy failed: {proxy[:30]}...")
continue
raise Exception(f"All retries failed for {url}")
Smarter rotation with health tracking
Track which proxies work and skip dead ones:
import requests
import random
from collections import defaultdict
from typing import Optional
class ProxyPool:
def __init__(self, proxy_list: list):
self.proxies = proxy_list
self.failures = defaultdict(int)
self.successes = defaultdict(int)
self.MAX_FAILURES = 3
def get_proxy(self) -> Optional[str]:
# Filter out proxies with too many failures
available = [
p for p in self.proxies
if self.failures[p] < self.MAX_FAILURES
]
if not available:
# Reset all proxies if all have failed
self.failures.clear()
available = self.proxies
# Weight by success rate
weights = []
for p in available:
success = self.successes[p] + 1 # +1 to avoid div by zero
failure = self.failures[p] + 1
weights.append(success / (success + failure))
return random.choices(available, weights=weights)[0]
def report_success(self, proxy: str):
self.successes[proxy] += 1
def report_failure(self, proxy: str):
self.failures[proxy] += 1
def request(self, url: str, **kwargs) -> requests.Response:
proxy = self.get_proxy()
proxy_dict = {"http": proxy, "https": proxy}
try:
response = requests.get(url, proxies=proxy_dict, **kwargs)
if response.status_code in (403, 429):
self.report_failure(proxy)
else:
self.report_success(proxy)
return response
except Exception:
self.report_failure(proxy)
raise
# Usage
pool = ProxyPool([
"http://user:pass@residential1.example.com:8080",
"http://user:pass@residential2.example.com:8080",
])
for url in target_urls:
response = pool.request(url, timeout=15)
# Process response
Provider rotation (rotating gateway)
Most residential proxy providers offer a "rotating gateway" — one endpoint that automatically cycles IPs:
import requests
# Dataimpulse rotating gateway example
PROXY = "http://username:password@gw.dataimpulse.com:823"
def scrape_with_rotating_proxy(url: str) -> str:
response = requests.get(
url,
proxies={"http": PROXY, "https": PROXY},
headers={"User-Agent": "Mozilla/5.0 Chrome/122.0.0.0"},
timeout=30
)
return response.text
# Every request gets a different IP automatically
for url in urls_to_scrape:
html = scrape_with_rotating_proxy(url)
This is simpler than managing a proxy list — the provider handles rotation.
Session-based rotation (sticky proxies)
Some scraping requires the same IP across multiple requests (login flows, multi-page sessions):
# Sticky session — same IP for 10 minutes
STICKY_PROXY = "http://username-country-US-session-abc123:password@gw.provider.com:823"
session = requests.Session()
session.proxies = {"http": STICKY_PROXY, "https": STICKY_PROXY}
# Step 1: Get homepage (establishes cookies)
session.get("https://target-site.com")
# Step 2: Login (same IP as step 1)
session.post("https://target-site.com/login", data={"email": "...", "password": "..."})
# Step 3: Scrape protected pages (same IP, authenticated session)
data = session.get("https://target-site.com/data")
Combined with TLS fingerprinting
Proxies fix the IP problem. But many sites also check TLS fingerprint (which identifies Python's requests library):
from curl_cffi import requests as cf_requests
# curl_cffi: residential proxy + Chrome TLS fingerprint
response = cf_requests.get(
"https://protected-site.com",
proxies={"https": "http://user:pass@residential.provider.com:8080"},
impersonate="chrome124", # Chrome TLS fingerprint
timeout=30
)
This combination (residential proxy + Chrome TLS fingerprint) handles ~80% of anti-bot systems.
What to expect at scale
Rough success rates by proxy type + target:
| Target | Datacenter | Residential | Mobile |
|---|---|---|---|
| Simple sites | 90%+ | 99%+ | 99%+ |
| Amazon | 5-20% | 85-95% | 95%+ |
| <5% | 60-75% | 80-90% | |
| Cloudflare sites | 10-30% | 70-85% | 85-95% |
Success rates drop significantly without session warm-up and realistic browsing patterns.
The managed alternative
If managing proxy pools feels like a full-time job, managed actors handle it:
The Apify Scrapers Bundle ($29) includes pre-built actors for the major platforms that handle proxy rotation, TLS fingerprinting, and session management internally. Pay-per-result means you don't pay for failed requests.
Key takeaways
- Datacenter proxies: fine for most sites, blocked by major platforms
- Residential proxies: required for Amazon, LinkedIn, Cloudflare
- Rotating gateways > managing proxy lists (simpler, more reliable)
- Sticky sessions: use when scraping multi-page flows
- Pair with
curl_cffifor TLS fingerprinting - Track proxy health to skip dead endpoints
n8n AI Automation Pack ($39) — 5 production-ready workflows
Ready-to-Use Scrapers
Pre-built actors that handle proxy rotation automatically:
Top comments (0)