DEV Community

11111
11111

Posted on

Why Your Scraper's Location Matters: Geo-Distributed Scraping with CAPTCHA Handling

You built a scraper that works perfectly from your laptop in San Francisco. You deploy it to a server in Frankfurt and suddenly CAPTCHAs appear on every request.

What changed? Your geographic location. CAPTCHAs are location-aware, and understanding this can make or break your scraping infrastructure.

How Location Affects CAPTCHA Difficulty

Anti-bot systems consider your IP's geographic context:

Factor Low Risk High Risk
IP location vs site audience Same country Different continent
Datacenter vs residential Residential Known datacenter
IP reputation Clean history Previous abuse
Region consistency Stable Jumping between countries

A residential IP in the US scraping a US e-commerce site? Low CAPTCHA rate. A Hetzner IP in Germany scraping the same site? CAPTCHA on every page.

Measuring the Impact

Before optimizing, measure your baseline:

import httpx
import time
from dataclasses import dataclass

@dataclassclass ProbeResult:
    region: str
    url: str
    captcha_detected: bool
    response_time: float
    status_code: int

async def probe_from_region(
    proxy: str, 
    region: str, 
    target_url: str
) -> ProbeResult:
    """Test CAPTCHA rate from a specific region."""
    start = time.monotonic()

    async with httpx.AsyncClient(
        proxies={"all://": proxy}
    ) as client:
        resp = await client.get(target_url)
        html = resp.text

        captcha = any(keyword in html for keyword in [
            "g-recaptcha", "h-captcha", 
            "cf-turnstile", "arkoselabs",
            "challenge-platform"
        ])

        return ProbeResult(
            region=region,
            url=target_url,
            captcha_detected=captcha,
            response_time=time.monotonic() - start,
            status_code=resp.status_code
        )

# Test from multiple regions
proxies = {
    "us-east": "socks5://proxy-us-east:1080",
    "us-west": "socks5://proxy-us-west:1080",
    "eu-west": "socks5://proxy-eu-west:1080",
    "ap-south": "socks5://proxy-ap-south:1080",
}

results = []
for region, proxy in proxies.items():
    for _ in range(10):  # 10 probes per region
        result = await probe_from_region(
            proxy, region, "https://target-site.com"
        )
        results.append(result)

# Analyze
for region in proxies:
    region_results = [r for r in results if r.region == region]
    captcha_rate = sum(
        1 for r in region_results if r.captcha_detected
    ) / len(region_results)
    avg_time = sum(
        r.response_time for r in region_results
    ) / len(region_results)
    print(
        f"{region}: CAPTCHA rate={captcha_rate:.0%}, "
        f"avg={avg_time:.2f}s"
    )
Enter fullscreen mode Exit fullscreen mode

Typical output:

us-east:  CAPTCHA rate=10%, avg=0.45s
us-west:  CAPTCHA rate=15%, avg=0.52s
eu-west:  CAPTCHA rate=60%, avg=0.89s
ap-south: CAPTCHA rate=85%, avg=1.23s
Enter fullscreen mode Exit fullscreen mode

Architecture: Geo-Distributed Scraper

Option 1: Region-Aware Proxy Selection

Route requests through proxies close to the target site's audience:

from dataclasses import dataclass, field

@dataclassclass RegionProxy:
    region: str
    proxy_url: str
    captcha_rate: float = 0.0
    total_requests: int = 0
    captcha_hits: int = 0

    def update_stats(self, had_captcha: bool):
        self.total_requests += 1
        if had_captcha:
            self.captcha_hits += 1
        self.captcha_rate = (
            self.captcha_hits / self.total_requests
        )


class GeoRouter:
    """Route requests to the best region 
    based on CAPTCHA rates."""

    def __init__(self, proxies: list[RegionProxy]):
        self.proxies = proxies
        self._lock = asyncio.Lock()

    async def get_best_proxy(
        self, target_domain: str
    ) -> RegionProxy:
        async with self._lock:
            # Sort by CAPTCHA rate (lower is better)
            # But ensure minimum samples
            candidates = [
                p for p in self.proxies
                if p.total_requests >= 5
            ]

            if not candidates:
                # Not enough data — round robin
                return self.proxies[
                    hash(target_domain) % len(self.proxies)
                ]

            return min(
                candidates, key=lambda p: p.captcha_rate
            )

    async def report_result(
        self, proxy: RegionProxy, had_captcha: bool
    ):
        async with self._lock:
            proxy.update_stats(had_captcha)


# Setup
router = GeoRouter([
    RegionProxy("us-east", "socks5://us-east:1080"),
    RegionProxy("us-west", "socks5://us-west:1080"),
    RegionProxy("eu-west", "socks5://eu-west:1080"),
])
Enter fullscreen mode Exit fullscreen mode

Option 2: Multi-Region Workers

Deploy workers in multiple regions and distribute work based on target site location:

import asyncio
from enum import Enum

class Region(Enum):
    US = "us"
    EU = "eu"
    ASIA = "asia"

# Map target domains to preferred regions
DOMAIN_REGION_MAP = {
    "amazon.com": Region.US,
    "amazon.co.uk": Region.EU,
    "amazon.co.jp": Region.ASIA,
    "ebay.com": Region.US,
    "zalando.de": Region.EU,
}

def get_preferred_region(url: str) -> Region:
    from urllib.parse import urlparse
    domain = urlparse(url).netloc

    # Check exact match
    if domain in DOMAIN_REGION_MAP:
        return DOMAIN_REGION_MAP[domain]

    # Check TLD
    tld = domain.split(".")[-1]
    tld_map = {
        "uk": Region.EU, "de": Region.EU,
        "fr": Region.EU, "it": Region.EU,
        "jp": Region.ASIA, "kr": Region.ASIA,
        "cn": Region.ASIA, "in": Region.ASIA,
    }
    return tld_map.get(tld, Region.US)


class DistributedScraper:
    def __init__(self):
        self.workers = {
            Region.US: WorkerPool("us-queue"),
            Region.EU: WorkerPool("eu-queue"),
            Region.ASIA: WorkerPool("asia-queue"),
        }

    async def scrape(self, url: str) -> dict:
        region = get_preferred_region(url)
        worker = self.workers[region]
        return await worker.submit(url)
Enter fullscreen mode Exit fullscreen mode

Handling CAPTCHAs When They Do Appear

Even with the right region, some CAPTCHAs are inevitable. Handle them without blocking:

class GeoAwareScraper:
    def __init__(self, router: GeoRouter):
        self.router = router
        self.solver = CaptchaSolver(
            api_base="https://www.passxapi.com"
        )

    async def fetch(self, url: str) -> dict:
        proxy = await self.router.get_best_proxy(
            urlparse(url).netloc
        )

        async with httpx.AsyncClient(
            proxies={"all://": proxy.proxy_url}
        ) as client:
            resp = await client.get(url)
            html = resp.text

            captcha = detect_captcha(html)

            if captcha:
                # Report CAPTCHA hit to router
                await self.router.report_result(
                    proxy, had_captcha=True
                )

                # Solve it
                token = await self.solver.solve(
                    captcha_type=captcha["type"],
                    sitekey=captcha["sitekey"],
                    url=url
                )

                # Resubmit with token
                resp = await client.post(url, data={
                    captcha["field_name"]: token
                })
                html = resp.text
            else:
                await self.router.report_result(
                    proxy, had_captcha=False
                )

            return {
                "url": url, 
                "html": html,
                "region": proxy.region,
                "captcha_solved": captcha is not None
            }
Enter fullscreen mode Exit fullscreen mode

Cost Optimization: Regional Pricing

Different proxy types have different costs. Optimize by mixing:

class CostOptimizedRouter:
    """Use cheap datacenter proxies when possible,
    expensive residential only when needed."""

    def __init__(self):
        self.tiers = {
            "datacenter": {
                "proxies": [...],
                "cost_per_gb": 0.50,
                "avg_captcha_rate": 0.40,
            },
            "residential": {
                "proxies": [...],
                "cost_per_gb": 8.00,
                "avg_captcha_rate": 0.05,
            },
        }

    def select_tier(
        self, target_url: str, 
        captcha_budget: float
    ) -> str:
        """Choose proxy tier based on 
        CAPTCHA cost tradeoff."""
        # Cost of CAPTCHA solve: ~$0.001
        # If datacenter CAPTCHA rate is 40%, 
        #   extra cost = 0.4 * $0.001 = $0.0004/req
        # Residential saves CAPTCHAs but costs more

        dc_total = (
            self.tiers["datacenter"]["cost_per_gb"] / 1000
            + self.tiers["datacenter"]["avg_captcha_rate"]
            * 0.001
        )
        res_total = (
            self.tiers["residential"]["cost_per_gb"] / 1000
            + self.tiers["residential"]["avg_captcha_rate"]
            * 0.001
        )

        # Use whichever is cheaper overall
        if dc_total < res_total:
            return "datacenter"
        return "residential"
Enter fullscreen mode Exit fullscreen mode

Deployment: Docker Compose for Multi-Region

# docker-compose.yml
services:
  scraper-us:
    build: .
    environment:
      - REGION=us-east
      - PROXY_URL=socks5://us-proxy:1080
      - QUEUE_NAME=scrape-us
    deploy:
      replicas: 3

  scraper-eu:
    build: .
    environment:
      - REGION=eu-west
      - PROXY_URL=socks5://eu-proxy:1080
      - QUEUE_NAME=scrape-eu
    deploy:
      replicas: 2

  coordinator:
    build: .
    command: python coordinator.py
    environment:
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Measure first — probe CAPTCHA rates from different regions before choosing
  2. Match your proxy location to the target audience — US site? Use US proxies
  3. Adaptive routing — let your system learn which regions work best
  4. Mix proxy tiers — datacenter for low-risk, residential for high-risk targets
  5. Always have a CAPTCHA solver ready — even perfect geo-matching won't avoid all CAPTCHAs

For the CAPTCHA-solving integration used in these examples, check out passxapi-python — it works from any region with consistent solve rates.


What regions do you deploy your scrapers in? Share your geo-strategy in the comments.

Top comments (0)