DEV Community

Miller James
Miller James

Posted on

Building a Proxy Failover Layer With Circuit Breaker Logic

Your scraper keeps firing requests at a proxy that's already dead — each one waits out the full timeout, retries pile up, and one bad residential proxy drags the whole job's latency through the floor. A circuit breaker fixes this by giving every proxy a three-state switch: after N failures it trips Open and fails fast instead of waiting, then after a cooldown it sends one trial request to test recovery before trusting the proxy again. Put a failover selector in front that skips any proxy whose circuit is Open, and a single dead endpoint stops poisoning the pool. This guide builds both in Python, including the error-classification logic that decides what should actually trip a breaker — the part most failover code gets wrong.

What circuit breaker logic adds over retry-and-rotate

A circuit breaker beats plain retry-and-rotate because it remembers which proxies are failing and stops calling them. Retry-and-rotate is stateless: it catches an error, moves to the next proxy, and on the very next request happily routes back to the proxy that just failed. A circuit breaker is a state machine that holds failure history per proxy, so a dead endpoint gets benched instead of retried in a loop.

The pattern has three states, originally formalized for fault-tolerant systems in Michael Nygard's Release It! and popularized in Martin Fowler's "CircuitBreaker" article (2014). In the Closed state, requests flow and the breaker counts failures; once failures hit a threshold it trips to Open, where every call fails fast without touching the proxy; after a reset timeout it moves to Half-Open and lets one trial request through to decide whether to close again or re-open. That trial is the self-healing part retry loops lack — the proxy gets re-tested automatically, not permanently blacklisted.

The payoff is concrete: fail-fast in the Open state means a dead proxy costs microseconds, not a 15-second timeout per request. So instead of N retries × full timeout burning while a proxy is down, the breaker short-circuits after the threshold and your failover selector spends that time on a working IP. That difference compounds hard at scale, where a handful of dead residential proxies in a large pool would otherwise eat most of your request budget.

[Image: State diagram of a proxy circuit breaker — three nodes Closed/Open/Half-Open with transitions labeled "fail_threshold reached", "reset_timeout elapsed", "trial succeeds", "trial fails" | Purpose: anchor the three-state model before the code | Alt: Circuit breaker state machine for residential proxy failover showing Closed, Open, and Half-Open transitions]

Decide what actually counts as a proxy failure

Classify every error before you let it trip a breaker — counting the wrong failures is how good proxies get wrongly retired. The critical distinction: a failure is the proxy's fault only when the proxy failed to deliver a response. A 404 or a 500 that came back through the proxy means the proxy worked perfectly; the target had the problem. Trip a breaker on those and you'll bench healthy residential proxies for errors they didn't cause.

Sort responses into four buckets, because each one needs a different action:

  • Proxy fault — connection refused, connect timeout, HTTP 502/503/504 from the gateway, or a ProxyError. The proxy itself failed. Trip the breaker.
  • Proxy auth error — HTTP 407. This is a config bug (bad credentials or session-token format), not a transient fault. Trip and alert; rotating won't fix it.
  • IP banned — HTTP 429, or a 403 carrying a block/CAPTCHA page. The exit IP got flagged, not the endpoint. Trip the per-IP breaker and rotate the IP, but don't penalize the gateway.
  • Target outcome — 200, 301, 404, or even a target-origin 500 returned through the proxy. The proxy did its job. Never trip; hand the response back to your application logic.
import requests

def classify(response=None, exc=None):
    if exc is not None:
        if isinstance(exc, (requests.exceptions.ProxyError,
                            requests.exceptions.ConnectTimeout,
                            requests.exceptions.ConnectionError,
                            requests.exceptions.ReadTimeout)):
            return "proxy_fault"      # nothing came back -> proxy's fault
        return "unknown"

    code = response.status_code
    if code == 407:
        return "proxy_auth"           # credentials/session-token wrong
    if code in (502, 503, 504):
        return "proxy_fault"          # gateway/upstream failure
    if code == 429 or (code == 403 and looks_like_block(response)):
        return "ip_banned"            # rotate the IP, not the endpoint
    return "target_outcome"           # 200/404/302/target-5xx -> not the proxy
Enter fullscreen mode Exit fullscreen mode

looks_like_block is yours to define per target — match the CAPTCHA marker or block-page signature you actually see, since a generic 403 can be either a real block or ordinary authorization. Get this classifier right and the rest of the layer almost configures itself; get it wrong and no threshold tuning will save you.

Build the circuit breaker state machine

Build the breaker as a small object that owns its state, thresholds, and timing — one instance per proxy. Use a monotonic clock for the timeouts so a system clock adjustment can't corrupt the Open-state window.

import time

class CircuitBreaker:
    """CLOSED -> OPEN -> HALF_OPEN -> CLOSED/OPEN."""

    def __init__(self, *, fail_threshold=4, reset_timeout=30.0, half_open_max=1):
        self.fail_threshold = fail_threshold
        self.reset_timeout = reset_timeout
        self.half_open_max = half_open_max
        self.state = "CLOSED"
        self.failures = 0
        self.opened_at = 0.0
        self.half_open_calls = 0

    def allows(self) -> bool:
        """True if a request may pass right now."""
        if self.state == "OPEN":
            if time.monotonic() - self.opened_at >= self.reset_timeout:
                self.state = "HALF_OPEN"
                self.half_open_calls = 0
            else:
                return False          # fail fast — don't touch a dead proxy
        if self.state == "HALF_OPEN" and self.half_open_calls >= self.half_open_max:
            return False              # only N trial calls while recovering
        if self.state == "HALF_OPEN":
            self.half_open_calls += 1
        return True

    def record_success(self):
        self.failures = 0
        self.half_open_calls = 0
        self.state = "CLOSED"

    def record_failure(self):
        self.failures += 1
        if self.state == "HALF_OPEN" or self.failures >= self.fail_threshold:
            self.state = "OPEN"
            self.opened_at = time.monotonic()
Enter fullscreen mode Exit fullscreen mode

The two non-obvious safeguards: a single failure in Half-Open re-opens the circuit immediately (a recovering proxy doesn't get the full threshold again), and half_open_max caps trial traffic so you never flood a proxy that's just coming back. Verify the machine in isolation before wiring it to network calls — feed it failures until it opens, advance a fake clock past reset_timeout, confirm allows() returns True exactly once, then feed it a success and confirm it closes.

Put a failover selector in front of your residential proxy pool

The failover selector is the layer that turns per-proxy breakers into a resilient pool — it picks a proxy whose circuit allows traffic, runs the request, and reports the outcome back to that proxy's breaker. Round-robin across the healthy circuits so traffic spreads instead of pinning the first available proxy.

class FailoverPool:
    def __init__(self, proxies, breaker_factory):
        self.entries = [{"proxy": p, "breaker": breaker_factory()} for p in proxies]
        self.cursor = 0

    def _pick(self):
        n = len(self.entries)
        for i in range(n):
            entry = self.entries[(self.cursor + i) % n]
            if entry["breaker"].allows():
                self.cursor = (self.cursor + i + 1) % n   # rotate for fairness
                return entry
        return None                                       # every circuit is open

    def execute(self, fn):
        """Run fn(proxy) through healthy proxies until one succeeds."""
        last = None
        for _ in range(len(self.entries)):
            entry = self._pick()
            if entry is None:
                raise RuntimeError("All proxy circuits are open")
            breaker = entry["breaker"]
            try:
                resp = fn(entry["proxy"])
            except Exception as exc:
                if classify(exc=exc) == "proxy_fault":
                    breaker.record_failure()
                    last = exc
                    continue
                raise                                     # not the proxy's fault
            kind = classify(response=resp)
            if kind in ("proxy_fault", "proxy_auth", "ip_banned"):
                breaker.record_failure()
                last = resp
                continue
            breaker.record_success()
            return resp                                   # target_outcome -> done
        raise RuntimeError(f"Failover exhausted; last={last}")
Enter fullscreen mode Exit fullscreen mode

Notice what execute returns on a target_outcome: the raw response, even if it's a 404, because the proxy succeeded and your application — not the failover layer — owns target errors. When _pick returns None, every circuit is Open; that's your signal to wait, degrade to a fallback tier, or fail the job rather than spin. The next sections decide where these breakers should live and how to size their thresholds.

Choose where the breaker lives: per-IP, per-endpoint, or per-provider

Place the breaker at the level you can actually control — that depends entirely on whether you hold individual IPs or hit a gateway. This single decision determines whether your failover layer even works, so make it before writing thresholds.

With dedicated residential proxies — a static list of discrete host:port IPs you own — put one breaker per IP. You can see and retire each IP independently, which is exactly what the FailoverPool above assumes.

With a backconnect residential proxy gateway — one endpoint that rotates exit IPs for you — you can't breaker individual IPs, because you don't address them. Here the breaker belongs at the endpoint level (is the gateway responding at all?), and IP-level failure is handled by classification: a 429 or block triggers a session rotation (new IP from the gateway), not a tripped endpoint breaker.

With multiple residential proxy providers, add a breaker per provider on top, so you can fail over from one network to another when an entire provider degrades. A backconnect gateway already does IP-level failover internally, so your job there is endpoint- and provider-level breaking, not per-IP — building per-IP logic on top of a pool you can't address is wasted code. Match the breaker granularity to the addressability you have, and the layer stays simple.

Tune the three thresholds

Set thresholds to your proxy economics, then verify against real trip logs — there's no universal number. Three knobs control the breaker, and each trades sensitivity against tolerance.

  • fail_threshold (start at 3–5). Lower trips faster and suits large pools where benching a proxy costs nothing; higher tolerates transient blips and suits small pools where you can't afford to bench IPs. Below 3, normal network noise trips healthy proxies.
  • reset_timeout (start at 30–60s for transport faults; 5–15 min for IP bans). This is how long a proxy stays benched. Transient gateway errors recover fast; a flagged exit IP needs far longer, which is why ban-class failures deserve their own longer-timeout breaker.
  • half_open_max (start at 1). Keep this at one trial. More just risks re-flooding a proxy that hasn't actually recovered.

Add jitter to reset_timeout — randomize it by ±20% — so breakers that opened together don't all transition to Half-Open at the same instant and stampede a recovering gateway. This thundering-herd guard matters most right after a shared upstream outage, when dozens of circuits would otherwise retry in lockstep. Verify your settings by logging every state transition for a day: if proxies trip and recover repeatedly (flapping), raise fail_threshold or reset_timeout; if dead proxies keep getting traffic, lower them. These are engineering starting points to tune against your own data, not measured constants.

Layer fallback tiers for graceful degradation

Stack providers into tiers so the layer degrades gracefully instead of failing hard when a whole pool goes Open. When every circuit in your primary residential proxy network is tripped, the request should fall through to a secondary provider before it ever fails the job.

class TieredFailover:
    def __init__(self, tiers):           # tiers = [primary_pool, secondary_pool, ...]
        self.tiers = tiers

    def execute(self, fn):
        last = None
        for tier in self.tiers:
            try:
                return tier.execute(fn)
            except RuntimeError as exc:   # this tier's circuits all open / exhausted
                last = exc
                continue
        raise RuntimeError(f"All tiers exhausted; last={last}")
Enter fullscreen mode Exit fullscreen mode

Order tiers by cost and quality: a high-trust primary network for normal load, a secondary for overflow and outages, and a final tier that queues or sheds load rather than hammering already-failing infrastructure. Running two independent residential proxy providers as tiers removes the single-vendor outage as a failure mode — if one network has a bad hour, traffic shifts automatically. The boundary to respect: graceful degradation means reduced capacity, not silent data loss, so make the bottom tier visible (alert, queue, or return a clear "capacity exhausted" signal) instead of dropping requests quietly.

Build your own failover layer or use a backconnect residential proxy network?

Build the circuit breaker either way — but build the IP-level failover yourself only if you actually hold and address individual IPs. The state-machine code above is identical regardless; the real question is who retires dead exit IPs and replenishes the pool.

Dimension Self-built over dedicated proxies Backconnect residential proxy network
Failover granularity Per-IP, fully under your control Per-endpoint; IP failover is internal to the gateway
Who retires dead IPs You — your breakers and monitoring The provider, automatically
IP supply / scale Capped at the IPs you own Large rotating pool behind one endpoint
Where your breaker lives Per individual IP Per endpoint and per provider
Operational load You run, tune, and watch the layer Provider handles IP health; you watch the gateway
Best fit Few dedicated IPs, full control needed High volume, hands-off IP management

Choose a self-built failover layer over dedicated residential proxies if all three hold: you own a list of individually addressable residential or ISP IPs, AND you need per-IP control (specific exit IPs for specific tasks), AND you can run and monitor the breaker state machine in production.

Choose a backconnect residential proxy network if any one is true: you hit a single rotating endpoint and can't address individual IPs, OR you need dead IPs retired and replaced automatically across a large pool, OR you want failover spanning a residential proxy network far larger than any list you could source and maintain. A backconnect residential proxy service such as proxy001.com handles IP-level failover inside the gateway, so your circuit breaker only needs endpoint- and provider-level logic — the FailoverPool collapses to one entry per gateway, and the code you wrote still runs. Even then, keep a thin endpoint breaker: a backconnect network removes per-IP work, not the need to fail over when a whole gateway degrades. Be wary of "unlimited residential proxies" claims as a substitute for failover design — unmetered bandwidth doesn't mean every exit IP is healthy, and your breaker is what proves which ones are.

Mistakes that make a circuit breaker worse than none

These bugs turn a breaker into a liability — it'll bench good proxies or stampede recovering ones, which is worse than a plain retry loop.

  • Tripping on target errors. Counting 404s or target-origin 500s as proxy failures retires healthy IPs. Only proxy_fault, proxy_auth, and ip_banned should ever call record_failure.
  • One shared breaker for the whole pool. A single breaker can't tell which proxy failed, so one bad IP trips the entire pool. One breaker per addressable unit, always.
  • No jitter on reset_timeout. Breakers that opened together re-test together and stampede the recovering upstream. Randomize the timeout ±20%.
  • Unbounded Half-Open traffic. Letting many trials through at once re-floods a proxy that hasn't recovered. Cap half_open_max at 1.
  • No alert on proxy_auth (407). It's a config bug that failover masks by rotating forever. Surface it instead of retrying.
  • Spinning when all circuits are Open. Tight-looping a fully-tripped pool wastes CPU and delays recovery. Fall to a fallback tier or back off with a wait.

Quick answers

What is a circuit breaker in a proxy failover layer? A per-proxy state machine with three states — Closed (requests flow, failures counted), Open (fail fast, proxy skipped), and Half-Open (one trial request tests recovery) — that stops your client from repeatedly calling a dead residential proxy.

What should trip a proxy circuit breaker? Only proxy-side failures: connection errors, timeouts, HTTP 502/503/504, 407 auth errors, and 429/CAPTCHA bans. Target outcomes like 404 or a target-origin 500 came back through the proxy and must not trip it.

Do I need a circuit breaker with a backconnect residential proxy? Yes, but only at the endpoint and provider level — the gateway handles IP-level failover internally, so you breaker the gateway, not individual exit IPs.

What threshold values should I start with? A failure threshold of 3–5, a reset timeout of 30–60 seconds for transport faults (5–15 minutes for IP bans), and one Half-Open trial — then tune against your own trip logs.

Top comments (0)