Posted on Mar 23

Why Your Scraper's Location Matters: Geo-Distributed Scraping with CAPTCHA Handling

You built a scraper that works perfectly from your laptop in San Francisco. You deploy it to a server in Frankfurt and suddenly CAPTCHAs appear on every request.

What changed? Your geographic location. CAPTCHAs are location-aware, and understanding this can make or break your scraping infrastructure.

How Location Affects CAPTCHA Difficulty

Anti-bot systems consider your IP's geographic context:

Factor	Low Risk	High Risk
IP location vs site audience	Same country	Different continent
Datacenter vs residential	Residential	Known datacenter
IP reputation	Clean history	Previous abuse
Region consistency	Stable	Jumping between countries

A residential IP in the US scraping a US e-commerce site? Low CAPTCHA rate. A Hetzner IP in Germany scraping the same site? CAPTCHA on every page.

Measuring the Impact

Before optimizing, measure your baseline:

import httpx
import time
from dataclasses import dataclass

@dataclassclass ProbeResult:
    region: str
    url: str
    captcha_detected: bool
    response_time: float
    status_code: int

async def probe_from_region(
    proxy: str, 
    region: str, 
    target_url: str
) -> ProbeResult:
    """Test CAPTCHA rate from a specific region."""
    start = time.monotonic()

    async with httpx.AsyncClient(
        proxies={"all://": proxy}
    ) as client:
        resp = await client.get(target_url)
        html = resp.text

        captcha = any(keyword in html for keyword in [
            "g-recaptcha", "h-captcha", 
            "cf-turnstile", "arkoselabs",
            "challenge-platform"
        ])

        return ProbeResult(
            region=region,
            url=target_url,
            captcha_detected=captcha,
            response_time=time.monotonic() - start,
            status_code=resp.status_code
        )

# Test from multiple regions
proxies = {
    "us-east": "socks5://proxy-us-east:1080",
    "us-west": "socks5://proxy-us-west:1080",
    "eu-west": "socks5://proxy-eu-west:1080",
    "ap-south": "socks5://proxy-ap-south:1080",
}

results = []
for region, proxy in proxies.items():
    for _ in range(10):  # 10 probes per region
        result = await probe_from_region(
            proxy, region, "https://target-site.com"
        )
        results.append(result)

# Analyze
for region in proxies:
    region_results = [r for r in results if r.region == region]
    captcha_rate = sum(
        1 for r in region_results if r.captcha_detected
    ) / len(region_results)
    avg_time = sum(
        r.response_time for r in region_results
    ) / len(region_results)
    print(
        f"{region}: CAPTCHA rate={captcha_rate:.0%}, "
        f"avg={avg_time:.2f}s"
    )

Typical output:

us-east:  CAPTCHA rate=10%, avg=0.45s
us-west:  CAPTCHA rate=15%, avg=0.52s
eu-west:  CAPTCHA rate=60%, avg=0.89s
ap-south: CAPTCHA rate=85%, avg=1.23s

Architecture: Geo-Distributed Scraper

Option 1: Region-Aware Proxy Selection

Route requests through proxies close to the target site's audience:

from dataclasses import dataclass, field

@dataclassclass RegionProxy:
    region: str
    proxy_url: str
    captcha_rate: float = 0.0
    total_requests: int = 0
    captcha_hits: int = 0

    def update_stats(self, had_captcha: bool):
        self.total_requests += 1
        if had_captcha:
            self.captcha_hits += 1
        self.captcha_rate = (
            self.captcha_hits / self.total_requests
        )


class GeoRouter:
    """Route requests to the best region 
    based on CAPTCHA rates."""

    def __init__(self, proxies: list[RegionProxy]):
        self.proxies = proxies
        self._lock = asyncio.Lock()

    async def get_best_proxy(
        self, target_domain: str
    ) -> RegionProxy:
        async with self._lock:
            # Sort by CAPTCHA rate (lower is better)
            # But ensure minimum samples
            candidates = [
                p for p in self.proxies
                if p.total_requests >= 5
            ]

            if not candidates:
                # Not enough data — round robin
                return self.proxies[
                    hash(target_domain) % len(self.proxies)
                ]

            return min(
                candidates, key=lambda p: p.captcha_rate
            )

    async def report_result(
        self, proxy: RegionProxy, had_captcha: bool
    ):
        async with self._lock:
            proxy.update_stats(had_captcha)


# Setup
router = GeoRouter([
    RegionProxy("us-east", "socks5://us-east:1080"),
    RegionProxy("us-west", "socks5://us-west:1080"),
    RegionProxy("eu-west", "socks5://eu-west:1080"),
])

Option 2: Multi-Region Workers

Deploy workers in multiple regions and distribute work based on target site location:

import asyncio
from enum import Enum

class Region(Enum):
    US = "us"
    EU = "eu"
    ASIA = "asia"

# Map target domains to preferred regions
DOMAIN_REGION_MAP = {
    "amazon.com": Region.US,
    "amazon.co.uk": Region.EU,
    "amazon.co.jp": Region.ASIA,
    "ebay.com": Region.US,
    "zalando.de": Region.EU,
}

def get_preferred_region(url: str) -> Region:
    from urllib.parse import urlparse
    domain = urlparse(url).netloc

    # Check exact match
    if domain in DOMAIN_REGION_MAP:
        return DOMAIN_REGION_MAP[domain]

    # Check TLD
    tld = domain.split(".")[-1]
    tld_map = {
        "uk": Region.EU, "de": Region.EU,
        "fr": Region.EU, "it": Region.EU,
        "jp": Region.ASIA, "kr": Region.ASIA,
        "cn": Region.ASIA, "in": Region.ASIA,
    }
    return tld_map.get(tld, Region.US)


class DistributedScraper:
    def __init__(self):
        self.workers = {
            Region.US: WorkerPool("us-queue"),
            Region.EU: WorkerPool("eu-queue"),
            Region.ASIA: WorkerPool("asia-queue"),
        }

    async def scrape(self, url: str) -> dict:
        region = get_preferred_region(url)
        worker = self.workers[region]
        return await worker.submit(url)

Handling CAPTCHAs When They Do Appear

Even with the right region, some CAPTCHAs are inevitable. Handle them without blocking:

class GeoAwareScraper:
    def __init__(self, router: GeoRouter):
        self.router = router
        self.solver = CaptchaSolver(
            api_base="https://www.passxapi.com"
        )

    async def fetch(self, url: str) -> dict:
        proxy = await self.router.get_best_proxy(
            urlparse(url).netloc
        )

        async with httpx.AsyncClient(
            proxies={"all://": proxy.proxy_url}
        ) as client:
            resp = await client.get(url)
            html = resp.text

            captcha = detect_captcha(html)

            if captcha:
                # Report CAPTCHA hit to router
                await self.router.report_result(
                    proxy, had_captcha=True
                )

                # Solve it
                token = await self.solver.solve(
                    captcha_type=captcha["type"],
                    sitekey=captcha["sitekey"],
                    url=url
                )

                # Resubmit with token
                resp = await client.post(url, data={
                    captcha["field_name"]: token
                })
                html = resp.text
            else:
                await self.router.report_result(
                    proxy, had_captcha=False
                )

            return {
                "url": url, 
                "html": html,
                "region": proxy.region,
                "captcha_solved": captcha is not None
            }

Cost Optimization: Regional Pricing

Different proxy types have different costs. Optimize by mixing:

class CostOptimizedRouter:
    """Use cheap datacenter proxies when possible,
    expensive residential only when needed."""

    def __init__(self):
        self.tiers = {
            "datacenter": {
                "proxies": [...],
                "cost_per_gb": 0.50,
                "avg_captcha_rate": 0.40,
            },
            "residential": {
                "proxies": [...],
                "cost_per_gb": 8.00,
                "avg_captcha_rate": 0.05,
            },
        }

    def select_tier(
        self, target_url: str, 
        captcha_budget: float
    ) -> str:
        """Choose proxy tier based on 
        CAPTCHA cost tradeoff."""
        # Cost of CAPTCHA solve: ~$0.001
        # If datacenter CAPTCHA rate is 40%, 
        #   extra cost = 0.4 * $0.001 = $0.0004/req
        # Residential saves CAPTCHAs but costs more

        dc_total = (
            self.tiers["datacenter"]["cost_per_gb"] / 1000
            + self.tiers["datacenter"]["avg_captcha_rate"]
            * 0.001
        )
        res_total = (
            self.tiers["residential"]["cost_per_gb"] / 1000
            + self.tiers["residential"]["avg_captcha_rate"]
            * 0.001
        )

        # Use whichever is cheaper overall
        if dc_total < res_total:
            return "datacenter"
        return "residential"

Deployment: Docker Compose for Multi-Region

# docker-compose.yml
services:
  scraper-us:
    build: .
    environment:
      - REGION=us-east
      - PROXY_URL=socks5://us-proxy:1080
      - QUEUE_NAME=scrape-us
    deploy:
      replicas: 3

  scraper-eu:
    build: .
    environment:
      - REGION=eu-west
      - PROXY_URL=socks5://eu-proxy:1080
      - QUEUE_NAME=scrape-eu
    deploy:
      replicas: 2

  coordinator:
    build: .
    command: python coordinator.py
    environment:
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis

  redis:
    image: redis:7-alpine

Key Takeaways

Measure first — probe CAPTCHA rates from different regions before choosing
Match your proxy location to the target audience — US site? Use US proxies
Adaptive routing — let your system learn which regions work best
Mix proxy tiers — datacenter for low-risk, residential for high-risk targets
Always have a CAPTCHA solver ready — even perfect geo-matching won't avoid all CAPTCHAs

For the CAPTCHA-solving integration used in these examples, check out passxapi-python — it works from any region with consistent solve rates.

What regions do you deploy your scrapers in? Share your geo-strategy in the comments.

DEV Community