DEV Community

Xavier Fok
Xavier Fok

Posted on

Proxy Failover: Building Resilient Infrastructure That Never Goes Down

Your proxy provider goes down at 3 AM. Your scraping pipeline stops. Your accounts cannot log in. Your monitoring goes blind. Single points of failure in proxy infrastructure are unacceptable for serious operations. Here is how to build resilient systems.

Common Failure Modes

Provider-Level Failures

  • Complete outage — Provider gateway goes offline
  • Partial degradation — Slow responses, increased error rates
  • Pool exhaustion — All IPs in your target geo are used or blocked
  • Authentication issues — API key expired, billing problems

Network-Level Failures

  • DNS resolution failure — Cannot resolve proxy gateway hostname
  • Connection timeouts — Network path to proxy is congested
  • SSL/TLS errors — Certificate issues on proxy gateway

IP-Level Failures

  • IP burned — Target platform blocked the proxy IP
  • IP misclassified — Proxy IP detected as datacenter instead of residential
  • Geographic mismatch — IP geolocates to wrong location

Failover Architecture

Your Application
        |
        v
   Proxy Router
   /     |     \\
Provider Provider Provider
   A       B       C
(Primary) (Secondary) (Tertiary)
Enter fullscreen mode Exit fullscreen mode

Multi-Provider Setup

class ResilientProxyManager:
    def __init__(self):
        self.providers = [
            {
                "name": "primary",
                "gateway": "gateway.provider-a.com:8080",
                "auth": "user_a:pass_a",
                "priority": 1,
                "healthy": True,
                "consecutive_failures": 0
            },
            {
                "name": "secondary",
                "gateway": "gateway.provider-b.com:8080",
                "auth": "user_b:pass_b",
                "priority": 2,
                "healthy": True,
                "consecutive_failures": 0
            },
            {
                "name": "tertiary",
                "gateway": "gateway.provider-c.com:8080",
                "auth": "user_c:pass_c",
                "priority": 3,
                "healthy": True,
                "consecutive_failures": 0
            }
        ]

    def get_proxy(self):
        # Sort by priority, filter healthy
        available = sorted(
            [p for p in self.providers if p["healthy"]],
            key=lambda p: p["priority"]
        )

        if not available:
            # All providers down - reset and try again
            self.reset_all()
            available = self.providers

        provider = available[0]
        return f"http://{provider["auth"]}@{provider["gateway"]}"

    def report_failure(self, provider_name):
        provider = self.get_provider(provider_name)
        provider["consecutive_failures"] += 1

        if provider["consecutive_failures"] >= 3:
            provider["healthy"] = False
            self.schedule_health_check(provider, delay=60)

    def report_success(self, provider_name):
        provider = self.get_provider(provider_name)
        provider["consecutive_failures"] = 0
        provider["healthy"] = True
Enter fullscreen mode Exit fullscreen mode

Circuit Breaker Pattern

The circuit breaker prevents hammering a failed provider:

import time

class CircuitBreaker:
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Provider failed, blocking requests
    HALF_OPEN = "half_open"  # Testing if provider recovered

    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.state = self.CLOSED
        self.failures = 0
        self.threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = 0

    def can_execute(self):
        if self.state == self.CLOSED:
            return True

        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = self.HALF_OPEN
                return True
            return False

        if self.state == self.HALF_OPEN:
            return True

        return False

    def record_success(self):
        self.failures = 0
        self.state = self.CLOSED

    def record_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()

        if self.failures >= self.threshold:
            self.state = self.OPEN
Enter fullscreen mode Exit fullscreen mode

Health Check System

import threading

class HealthChecker:
    def __init__(self, proxy_manager, check_interval=30):
        self.proxy_manager = proxy_manager
        self.interval = check_interval
        self.running = True

    def start(self):
        thread = threading.Thread(target=self.run, daemon=True)
        thread.start()

    def run(self):
        while self.running:
            for provider in self.proxy_manager.providers:
                self.check_provider(provider)
            time.sleep(self.interval)

    def check_provider(self, provider):
        proxy = f"http://{provider["auth"]}@{provider["gateway"]}"

        try:
            start = time.time()
            response = requests.get(
                "https://httpbin.org/ip",
                proxies={"http": proxy, "https": proxy},
                timeout=10
            )
            latency = time.time() - start

            if response.status_code == 200:
                provider["healthy"] = True
                provider["latency"] = latency
                provider["consecutive_failures"] = 0
            else:
                self.handle_check_failure(provider)

        except Exception:
            self.handle_check_failure(provider)

    def handle_check_failure(self, provider):
        provider["consecutive_failures"] += 1
        if provider["consecutive_failures"] >= 3:
            provider["healthy"] = False
            send_alert(f"Provider {provider["name"]} is DOWN")
Enter fullscreen mode Exit fullscreen mode

Failover Best Practices

  1. Use at least 2 providers — Single provider is single point of failure
  2. Test failover regularly — Simulate provider failures to verify your system works
  3. Set up monitoring — Know about failures before they impact operations
  4. Maintain warm standby — Secondary providers should have active credentials and tested connectivity
  5. Document provider SLAs — Know what uptime each provider guarantees
  6. Budget for redundancy — Failover providers cost money even when idle
  7. Automate recovery — Manual failover at 3 AM is not reliable

Cost of Downtime vs Cost of Redundancy

Metric Single Provider Multi-Provider
Monthly proxy cost $200 $250-300
Expected uptime 99% 99.9%+
Monthly downtime ~7 hours ~43 minutes
Impact of outage Total stop Seamless failover

The extra 25-50% cost for redundancy prevents hours of downtime per month.

For proxy failover architecture and infrastructure resilience guides, visit DataResearchTools.

Top comments (0)