DEV Community

Xavier Fok
Xavier Fok

Posted on

Web Scraping with Residential Proxies: Architecture, Pitfalls, and Best Practices

Web scraping at scale is an arms race. Websites deploy increasingly sophisticated anti-bot measures, and scrapers need equally sophisticated evasion techniques. One of the most effective tools in a scraper's arsenal is the residential proxy — and specifically, mobile residential proxies.

In this guide, I'll walk through the architecture of a production-grade scraping system using residential proxies, common pitfalls I've encountered, and battle-tested best practices.

Why Residential Proxies for Scraping?

Not all proxies are created equal. Here's the hierarchy of trust from a website's perspective:

Proxy Type Detection Risk Speed Cost/GB Best For
Datacenter High Very Fast $0.50-1 Low-value targets
Residential Low Medium $2-5 Most websites
Mobile Very Low Slower $3-8 High-security targets

Datacenter IPs are easily identified because they belong to known hosting providers (AWS, DigitalOcean, etc.). Websites maintain databases of these IP ranges and block them aggressively.

Residential proxies use IPs assigned by ISPs to real homes and mobile devices. They're legitimate consumer IPs, making them nearly impossible to distinguish from real users.

For the toughest targets (social media platforms, sneaker sites, ticketing platforms), mobile proxies are the gold standard because mobile IPs are shared by thousands of users via carrier-grade NAT.

Architecture: Building a Scraping Pipeline

Here's how I structure a production scraping system:

                    ┌─────────────────┐
                    │   URL Queue     │
                    │  (Redis/RabbitMQ)│
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Scraper Pool   │
                    │  (N workers)    │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Proxy Manager  │
                    │  (rotation +    │
                    │   health check) │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼───┐  ┌──────▼─────┐  ┌────▼────────┐
     │ Residential│  │   Mobile   │  │  Datacenter │
     │ Proxy Pool │  │ Proxy Pool │  │  Proxy Pool │
     └────────────┘  └────────────┘  └─────────────┘
Enter fullscreen mode Exit fullscreen mode

The key component is the Proxy Manager. It handles:

  • Rotation strategy (per-request or per-session)
  • Health checking (removing dead/blocked proxies)
  • Pool selection (residential for tough sites, datacenter for easy ones)
  • Rate limiting (respecting per-IP request limits)

The Proxy Manager in Code

import random
import time
from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class ProxyConfig:
    host: str
    port: int
    username: str
    password: str
    proxy_type: str  # 'residential', 'mobile', 'datacenter'
    last_used: float = 0
    failures: int = 0

    @property
    def url(self):
        return f"http://{self.username}:{self.password}@{self.host}:{self.port}"

class ProxyManager:
    def __init__(self, proxies: list[ProxyConfig], min_delay: float = 2.0):
        self.proxies = {p.proxy_type: [] for p in proxies}
        for p in proxies:
            self.proxies[p.proxy_type].append(p)
        self.min_delay = min_delay
        self.blocked_domains = defaultdict(set)  # domain -> set of blocked proxy URLs

    def get_proxy(self, domain: str, prefer_type: str = 'residential') -> ProxyConfig:
        """Get the best available proxy for a given domain"""
        pool = self.proxies.get(prefer_type, self.proxies.get('residential', []))

        # Filter out blocked and recently-used proxies
        available = [
            p for p in pool 
            if p.url not in self.blocked_domains[domain]
            and (time.time() - p.last_used) > self.min_delay
            and p.failures < 3
        ]

        if not available:
            # Fallback: reset failures and try again
            for p in pool:
                p.failures = 0
            available = pool

        # Select proxy with longest idle time
        proxy = min(available, key=lambda p: p.last_used)
        proxy.last_used = time.time()
        return proxy

    def report_failure(self, proxy: ProxyConfig, domain: str):
        """Report a proxy failure for a specific domain"""
        proxy.failures += 1
        if proxy.failures >= 3:
            self.blocked_domains[domain].add(proxy.url)

    def report_success(self, proxy: ProxyConfig):
        """Reset failure count on success"""
        proxy.failures = 0
Enter fullscreen mode Exit fullscreen mode

Choosing the Right Rotation Strategy

This is where most people get it wrong. The rotation strategy should match your target:

Per-Request Rotation

Every request gets a new IP. Good for:

  • Scraping search results
  • Collecting product listings
  • Any stateless data collection

Use a rotating proxy with backconnect architecture for this — you connect to one endpoint and the gateway handles rotation automatically.

Session-Based Rotation

Same IP for a series of related requests, then rotate. Good for:

  • Scraping paginated results (page 1, 2, 3... same session)
  • Following links within a site
  • Any multi-step extraction

This requires sticky sessions — most providers offer this with a session parameter in the proxy URL.

Time-Based Rotation

IP changes at fixed intervals. Good for:

  • Long-running monitoring tasks
  • Price tracking
  • Availability checking

For more on rotation strategies, see this guide on IP rotation for web scraping.

Common Pitfalls (and How to Avoid Them)

1. Ignoring Request Fingerprinting

Changing your IP but sending identical headers is like wearing a disguise but keeping your name tag on.

# BAD: Same headers every request
headers = {"User-Agent": "Mozilla/5.0"}

# GOOD: Rotate realistic headers
import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/121.0.0.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/121.0.0.0",
]

def get_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
    }
Enter fullscreen mode Exit fullscreen mode

2. Scraping Too Fast

Even with proxies, hitting a site 100 times per second from different IPs will trigger rate limits. Sites analyze traffic patterns, not just individual IPs.

import asyncio
import random

async def respectful_scrape(urls, proxy_manager):
    """Scrape with human-like delays"""
    for url in urls:
        proxy = proxy_manager.get_proxy(domain=extract_domain(url))

        # Add random delay (1-5 seconds)
        await asyncio.sleep(random.uniform(1.0, 5.0))

        response = await fetch(url, proxy=proxy.url)
        yield response
Enter fullscreen mode Exit fullscreen mode

3. Not Handling Proxy Failures Gracefully

Proxies fail. Mobile connections drop. Residential IPs get rotated by the provider. Your scraper needs to handle this:

async def fetch_with_fallback(url, proxy_manager, max_retries=3):
    """Fetch with proxy fallback and type escalation"""
    proxy_types = ['datacenter', 'residential', 'mobile']

    for proxy_type in proxy_types:
        for attempt in range(max_retries):
            proxy = proxy_manager.get_proxy(
                domain=extract_domain(url), 
                prefer_type=proxy_type
            )
            try:
                response = await fetch(url, proxy=proxy.url, timeout=30)
                if response.status_code == 200:
                    proxy_manager.report_success(proxy)
                    return response
                elif response.status_code in (403, 429):
                    proxy_manager.report_failure(proxy, extract_domain(url))
            except Exception:
                proxy_manager.report_failure(proxy, extract_domain(url))
                await asyncio.sleep(2 ** attempt)

    raise Exception(f"All proxy types exhausted for {url}")
Enter fullscreen mode Exit fullscreen mode

4. Using the Wrong Proxy Type

Don't use expensive mobile proxies for scraping sites that don't even block datacenter IPs. Start cheap and escalate:

  1. Try datacenter proxies first
  2. If blocked, switch to residential
  3. For the toughest targets, use mobile proxies for web scraping

5. Neglecting Proxy Health Monitoring

Always verify your proxies are working before and during scraping sessions. Implement health checks:

async def check_proxy_health(proxy: ProxyConfig) -> bool:
    """Check if a proxy is working and not blocked"""
    try:
        response = await fetch(
            "https://httpbin.org/ip", 
            proxy=proxy.url, 
            timeout=10
        )
        data = response.json()
        # Verify we're getting the proxy IP, not our real IP
        return data.get('origin') != OUR_REAL_IP
    except Exception:
        return False
Enter fullscreen mode Exit fullscreen mode

For a comprehensive proxy testing checklist, see how to check if your proxy is working.

Bandwidth Optimization

Residential and mobile proxies charge by bandwidth. Here's how to minimize costs:

  1. Disable images and CSS: If you only need text data, skip resources
  2. Use HEAD requests first: Check if content has changed before fetching
  3. Compress responses: Always send Accept-Encoding: gzip
  4. Cache aggressively: Don't re-scrape unchanged pages

If bandwidth costs are a concern, look into rotating proxies with unlimited bandwidth plans — they exist but usually trade speed for cost savings.

For a full pricing breakdown of different proxy types, check this proxy pricing guide.

When to Upgrade to Mobile Proxies

You should consider mobile proxies when:

  • Residential IPs are getting flagged
  • You're targeting social media platforms (Instagram, TikTok, etc.)
  • You need the highest possible trust score
  • You're dealing with sneaker sites or high-security targets

Mobile proxies work because IP reputation and trust scores are highest for mobile carrier IPs — platforms know that blocking a mobile IP means potentially blocking thousands of legitimate users.

Production Checklist

Before deploying your scraper:

  • [ ] Proxy rotation configured (per-request or session-based)
  • [ ] User-agent rotation enabled
  • [ ] Request delays implemented (1-5 second random delays)
  • [ ] Retry logic with proxy escalation (datacenter → residential → mobile)
  • [ ] Proxy health monitoring active
  • [ ] Error handling for 403, 429, and timeout responses
  • [ ] Bandwidth optimization (gzip, no images, caching)
  • [ ] Logging for debugging and monitoring
  • [ ] Respectful crawl rate (check robots.txt)

Resources

For deeper dives into specific topics:


What's your scraping stack look like? Any proxy tricks I missed? Let me know in the comments.

Top comments (0)