Web scraping at scale is an arms race. Websites deploy increasingly sophisticated anti-bot measures, and scrapers need equally sophisticated evasion techniques. One of the most effective tools in a scraper's arsenal is the residential proxy — and specifically, mobile residential proxies.
In this guide, I'll walk through the architecture of a production-grade scraping system using residential proxies, common pitfalls I've encountered, and battle-tested best practices.
Why Residential Proxies for Scraping?
Not all proxies are created equal. Here's the hierarchy of trust from a website's perspective:
| Proxy Type | Detection Risk | Speed | Cost/GB | Best For |
|---|---|---|---|---|
| Datacenter | High | Very Fast | $0.50-1 | Low-value targets |
| Residential | Low | Medium | $2-5 | Most websites |
| Mobile | Very Low | Slower | $3-8 | High-security targets |
Datacenter IPs are easily identified because they belong to known hosting providers (AWS, DigitalOcean, etc.). Websites maintain databases of these IP ranges and block them aggressively.
Residential proxies use IPs assigned by ISPs to real homes and mobile devices. They're legitimate consumer IPs, making them nearly impossible to distinguish from real users.
For the toughest targets (social media platforms, sneaker sites, ticketing platforms), mobile proxies are the gold standard because mobile IPs are shared by thousands of users via carrier-grade NAT.
Architecture: Building a Scraping Pipeline
Here's how I structure a production scraping system:
┌─────────────────┐
│ URL Queue │
│ (Redis/RabbitMQ)│
└────────┬────────┘
│
┌────────▼────────┐
│ Scraper Pool │
│ (N workers) │
└────────┬────────┘
│
┌────────▼────────┐
│ Proxy Manager │
│ (rotation + │
│ health check) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌────────▼───┐ ┌──────▼─────┐ ┌────▼────────┐
│ Residential│ │ Mobile │ │ Datacenter │
│ Proxy Pool │ │ Proxy Pool │ │ Proxy Pool │
└────────────┘ └────────────┘ └─────────────┘
The key component is the Proxy Manager. It handles:
- Rotation strategy (per-request or per-session)
- Health checking (removing dead/blocked proxies)
- Pool selection (residential for tough sites, datacenter for easy ones)
- Rate limiting (respecting per-IP request limits)
The Proxy Manager in Code
import random
import time
from dataclasses import dataclass, field
from collections import defaultdict
@dataclass
class ProxyConfig:
host: str
port: int
username: str
password: str
proxy_type: str # 'residential', 'mobile', 'datacenter'
last_used: float = 0
failures: int = 0
@property
def url(self):
return f"http://{self.username}:{self.password}@{self.host}:{self.port}"
class ProxyManager:
def __init__(self, proxies: list[ProxyConfig], min_delay: float = 2.0):
self.proxies = {p.proxy_type: [] for p in proxies}
for p in proxies:
self.proxies[p.proxy_type].append(p)
self.min_delay = min_delay
self.blocked_domains = defaultdict(set) # domain -> set of blocked proxy URLs
def get_proxy(self, domain: str, prefer_type: str = 'residential') -> ProxyConfig:
"""Get the best available proxy for a given domain"""
pool = self.proxies.get(prefer_type, self.proxies.get('residential', []))
# Filter out blocked and recently-used proxies
available = [
p for p in pool
if p.url not in self.blocked_domains[domain]
and (time.time() - p.last_used) > self.min_delay
and p.failures < 3
]
if not available:
# Fallback: reset failures and try again
for p in pool:
p.failures = 0
available = pool
# Select proxy with longest idle time
proxy = min(available, key=lambda p: p.last_used)
proxy.last_used = time.time()
return proxy
def report_failure(self, proxy: ProxyConfig, domain: str):
"""Report a proxy failure for a specific domain"""
proxy.failures += 1
if proxy.failures >= 3:
self.blocked_domains[domain].add(proxy.url)
def report_success(self, proxy: ProxyConfig):
"""Reset failure count on success"""
proxy.failures = 0
Choosing the Right Rotation Strategy
This is where most people get it wrong. The rotation strategy should match your target:
Per-Request Rotation
Every request gets a new IP. Good for:
- Scraping search results
- Collecting product listings
- Any stateless data collection
Use a rotating proxy with backconnect architecture for this — you connect to one endpoint and the gateway handles rotation automatically.
Session-Based Rotation
Same IP for a series of related requests, then rotate. Good for:
- Scraping paginated results (page 1, 2, 3... same session)
- Following links within a site
- Any multi-step extraction
This requires sticky sessions — most providers offer this with a session parameter in the proxy URL.
Time-Based Rotation
IP changes at fixed intervals. Good for:
- Long-running monitoring tasks
- Price tracking
- Availability checking
For more on rotation strategies, see this guide on IP rotation for web scraping.
Common Pitfalls (and How to Avoid Them)
1. Ignoring Request Fingerprinting
Changing your IP but sending identical headers is like wearing a disguise but keeping your name tag on.
# BAD: Same headers every request
headers = {"User-Agent": "Mozilla/5.0"}
# GOOD: Rotate realistic headers
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/121.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/121.0.0.0",
]
def get_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
2. Scraping Too Fast
Even with proxies, hitting a site 100 times per second from different IPs will trigger rate limits. Sites analyze traffic patterns, not just individual IPs.
import asyncio
import random
async def respectful_scrape(urls, proxy_manager):
"""Scrape with human-like delays"""
for url in urls:
proxy = proxy_manager.get_proxy(domain=extract_domain(url))
# Add random delay (1-5 seconds)
await asyncio.sleep(random.uniform(1.0, 5.0))
response = await fetch(url, proxy=proxy.url)
yield response
3. Not Handling Proxy Failures Gracefully
Proxies fail. Mobile connections drop. Residential IPs get rotated by the provider. Your scraper needs to handle this:
async def fetch_with_fallback(url, proxy_manager, max_retries=3):
"""Fetch with proxy fallback and type escalation"""
proxy_types = ['datacenter', 'residential', 'mobile']
for proxy_type in proxy_types:
for attempt in range(max_retries):
proxy = proxy_manager.get_proxy(
domain=extract_domain(url),
prefer_type=proxy_type
)
try:
response = await fetch(url, proxy=proxy.url, timeout=30)
if response.status_code == 200:
proxy_manager.report_success(proxy)
return response
elif response.status_code in (403, 429):
proxy_manager.report_failure(proxy, extract_domain(url))
except Exception:
proxy_manager.report_failure(proxy, extract_domain(url))
await asyncio.sleep(2 ** attempt)
raise Exception(f"All proxy types exhausted for {url}")
4. Using the Wrong Proxy Type
Don't use expensive mobile proxies for scraping sites that don't even block datacenter IPs. Start cheap and escalate:
- Try datacenter proxies first
- If blocked, switch to residential
- For the toughest targets, use mobile proxies for web scraping
5. Neglecting Proxy Health Monitoring
Always verify your proxies are working before and during scraping sessions. Implement health checks:
async def check_proxy_health(proxy: ProxyConfig) -> bool:
"""Check if a proxy is working and not blocked"""
try:
response = await fetch(
"https://httpbin.org/ip",
proxy=proxy.url,
timeout=10
)
data = response.json()
# Verify we're getting the proxy IP, not our real IP
return data.get('origin') != OUR_REAL_IP
except Exception:
return False
For a comprehensive proxy testing checklist, see how to check if your proxy is working.
Bandwidth Optimization
Residential and mobile proxies charge by bandwidth. Here's how to minimize costs:
- Disable images and CSS: If you only need text data, skip resources
- Use HEAD requests first: Check if content has changed before fetching
- Compress responses: Always send Accept-Encoding: gzip
- Cache aggressively: Don't re-scrape unchanged pages
If bandwidth costs are a concern, look into rotating proxies with unlimited bandwidth plans — they exist but usually trade speed for cost savings.
For a full pricing breakdown of different proxy types, check this proxy pricing guide.
When to Upgrade to Mobile Proxies
You should consider mobile proxies when:
- Residential IPs are getting flagged
- You're targeting social media platforms (Instagram, TikTok, etc.)
- You need the highest possible trust score
- You're dealing with sneaker sites or high-security targets
Mobile proxies work because IP reputation and trust scores are highest for mobile carrier IPs — platforms know that blocking a mobile IP means potentially blocking thousands of legitimate users.
Production Checklist
Before deploying your scraper:
- [ ] Proxy rotation configured (per-request or session-based)
- [ ] User-agent rotation enabled
- [ ] Request delays implemented (1-5 second random delays)
- [ ] Retry logic with proxy escalation (datacenter → residential → mobile)
- [ ] Proxy health monitoring active
- [ ] Error handling for 403, 429, and timeout responses
- [ ] Bandwidth optimization (gzip, no images, caching)
- [ ] Logging for debugging and monitoring
- [ ] Respectful crawl rate (check robots.txt)
Resources
For deeper dives into specific topics:
- Residential vs. Datacenter vs. Mobile Proxies — full comparison
- Residential Backconnect Proxies — how they work
- Best Mobile Proxies for 2026 — provider comparison
- Data Research Tools — comprehensive proxy guides library
What's your scraping stack look like? Any proxy tricks I missed? Let me know in the comments.
Top comments (0)