Xavier Fok

Posted on Mar 8

Proxy Failover: Building Resilient Infrastructure That Never Goes Down

#proxy #devops #automation #tutorial

Your proxy provider goes down at 3 AM. Your scraping pipeline stops. Your accounts cannot log in. Your monitoring goes blind. Single points of failure in proxy infrastructure are unacceptable for serious operations. Here is how to build resilient systems.

Common Failure Modes

Provider-Level Failures

Complete outage — Provider gateway goes offline
Partial degradation — Slow responses, increased error rates
Pool exhaustion — All IPs in your target geo are used or blocked
Authentication issues — API key expired, billing problems

Network-Level Failures

DNS resolution failure — Cannot resolve proxy gateway hostname
Connection timeouts — Network path to proxy is congested
SSL/TLS errors — Certificate issues on proxy gateway

IP-Level Failures

IP burned — Target platform blocked the proxy IP
IP misclassified — Proxy IP detected as datacenter instead of residential
Geographic mismatch — IP geolocates to wrong location

Failover Architecture

Your Application
        |
        v
   Proxy Router
   /     |     \\
Provider Provider Provider
   A       B       C
(Primary) (Secondary) (Tertiary)

Multi-Provider Setup

class ResilientProxyManager:
    def __init__(self):
        self.providers = [
            {
                "name": "primary",
                "gateway": "gateway.provider-a.com:8080",
                "auth": "user_a:pass_a",
                "priority": 1,
                "healthy": True,
                "consecutive_failures": 0
            },
            {
                "name": "secondary",
                "gateway": "gateway.provider-b.com:8080",
                "auth": "user_b:pass_b",
                "priority": 2,
                "healthy": True,
                "consecutive_failures": 0
            },
            {
                "name": "tertiary",
                "gateway": "gateway.provider-c.com:8080",
                "auth": "user_c:pass_c",
                "priority": 3,
                "healthy": True,
                "consecutive_failures": 0
            }
        ]

    def get_proxy(self):
        # Sort by priority, filter healthy
        available = sorted(
            [p for p in self.providers if p["healthy"]],
            key=lambda p: p["priority"]
        )

        if not available:
            # All providers down - reset and try again
            self.reset_all()
            available = self.providers

        provider = available[0]
        return f"http://{provider["auth"]}@{provider["gateway"]}"

    def report_failure(self, provider_name):
        provider = self.get_provider(provider_name)
        provider["consecutive_failures"] += 1

        if provider["consecutive_failures"] >= 3:
            provider["healthy"] = False
            self.schedule_health_check(provider, delay=60)

    def report_success(self, provider_name):
        provider = self.get_provider(provider_name)
        provider["consecutive_failures"] = 0
        provider["healthy"] = True

Circuit Breaker Pattern

The circuit breaker prevents hammering a failed provider:

import time

class CircuitBreaker:
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Provider failed, blocking requests
    HALF_OPEN = "half_open"  # Testing if provider recovered

    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.state = self.CLOSED
        self.failures = 0
        self.threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = 0

    def can_execute(self):
        if self.state == self.CLOSED:
            return True

        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = self.HALF_OPEN
                return True
            return False

        if self.state == self.HALF_OPEN:
            return True

        return False

    def record_success(self):
        self.failures = 0
        self.state = self.CLOSED

    def record_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()

        if self.failures >= self.threshold:
            self.state = self.OPEN

Health Check System

import threading

class HealthChecker:
    def __init__(self, proxy_manager, check_interval=30):
        self.proxy_manager = proxy_manager
        self.interval = check_interval
        self.running = True

    def start(self):
        thread = threading.Thread(target=self.run, daemon=True)
        thread.start()

    def run(self):
        while self.running:
            for provider in self.proxy_manager.providers:
                self.check_provider(provider)
            time.sleep(self.interval)

    def check_provider(self, provider):
        proxy = f"http://{provider["auth"]}@{provider["gateway"]}"

        try:
            start = time.time()
            response = requests.get(
                "https://httpbin.org/ip",
                proxies={"http": proxy, "https": proxy},
                timeout=10
            )
            latency = time.time() - start

            if response.status_code == 200:
                provider["healthy"] = True
                provider["latency"] = latency
                provider["consecutive_failures"] = 0
            else:
                self.handle_check_failure(provider)

        except Exception:
            self.handle_check_failure(provider)

    def handle_check_failure(self, provider):
        provider["consecutive_failures"] += 1
        if provider["consecutive_failures"] >= 3:
            provider["healthy"] = False
            send_alert(f"Provider {provider["name"]} is DOWN")

Failover Best Practices

Use at least 2 providers — Single provider is single point of failure
Test failover regularly — Simulate provider failures to verify your system works
Set up monitoring — Know about failures before they impact operations
Maintain warm standby — Secondary providers should have active credentials and tested connectivity
Document provider SLAs — Know what uptime each provider guarantees
Budget for redundancy — Failover providers cost money even when idle
Automate recovery — Manual failover at 3 AM is not reliable

Cost of Downtime vs Cost of Redundancy

Metric	Single Provider	Multi-Provider
Monthly proxy cost	$200	$250-300
Expected uptime	99%	99.9%+
Monthly downtime	~7 hours	~43 minutes
Impact of outage	Total stop	Seamless failover

The extra 25-50% cost for redundancy prevents hours of downtime per month.

For proxy failover architecture and infrastructure resilience guides, visit DataResearchTools.

DEV Community