137Foundry

Posted on Jun 25

Why Circuit Breakers and Retry Loops Belong Together (and What Each Does On Its Own)

#dataengineering #python #devops #architecture

A retry loop without a circuit breaker keeps hammering a dead upstream until the retry budget is exhausted. A circuit breaker without a retry loop fails immediately on the first transient blip even when a retry would have succeeded. Each pattern solves part of the resilience problem; neither replaces the other; combining them gives you the system most teams actually want.

This piece is about what each one does, why the combination is more than the sum of the parts, and where the patterns most often go wrong in practice.

What a Retry Loop Does

A retry loop is the pattern for absorbing transient failures. Upstream returned a 502 or had a brief network hiccup; the retry loop waits, tries again, and recovers. The pattern works well when the underlying problem is short-lived (seconds to minutes) and the upstream is otherwise healthy.

The retry loop's weakness is the case where the upstream is genuinely down for an extended period. Every job hitting the loop spends its full retry budget before giving up. The aggregate effect on a fleet of clients is wasteful: every client is doing the same retry pattern against the same dead upstream, consuming compute and network on attempts that will not succeed.

A naive retry loop also lengthens the impact of an upstream outage rather than shortening it. Synchronized retry waves from many clients hammer the upstream during its recovery window, potentially extending the outage or making the recovery harder.

What a Circuit Breaker Does

A circuit breaker tracks recent failures against an upstream. When the failure rate crosses a threshold (often expressed as N failures in M attempts, or N consecutive failures), the breaker opens: new requests fail immediately without attempting the upstream call. The application gets a fast failure response instead of waiting for a timeout.

After a cooldown period, the breaker enters a half-open state and allows one trial request through. If it succeeds, the breaker closes and normal operation resumes. If it fails, the breaker re-opens with an extended cooldown and retries the trial later.

The breaker's strength is in protecting the upstream from a fleet of retrying clients during an outage, and in letting the calling application know quickly when the upstream is gone so it can take a different path (queue the work for later, serve a degraded response, alert immediately rather than waiting for a long timeout).

The breaker's weakness on its own is that brief transient failures trigger the breaker before they have a chance to resolve. A breaker without a retry loop fails the first request that hits a 502, even though a single retry would have succeeded.

Photo by David Brown on Pexels

Why the Combination Works Better

The two patterns operate at different timescales and protect against different failure modes. Retries handle short transient failures within the timeline of a single request. Circuit breakers handle longer outages across many requests.

The combined pattern: each individual request gets a small retry budget (3-5 attempts with exponential backoff) to absorb transient blips. The circuit breaker tracks the aggregate failure rate across requests. If too many requests fail despite their retries, the breaker opens and stops sending new requests until the upstream recovers.

This is what production-grade resilience looks like in practice. The application is protected from transient blips by the retry loop, and the upstream is protected from a fleet of retrying clients by the breaker. Both ends of the connection are healthier than either pattern would produce alone.

What Each Pattern Looks Like In Code

The retry loop is the simpler of the two. Using Tenacity in Python:

from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

@retry(retry=retry_if_exception_type(TransientFailure), wait=wait_exponential_jitter(initial=1, max=10), stop=stop_after_attempt(5))
def call_upstream(payload):
    response = httpx.post(UPSTREAM_URL, json=payload)
    if 500 <= response.status_code < 600:
        raise TransientFailure(f"Upstream returned {response.status_code}")
    return response.json()

The circuit breaker wraps the function above. Pybreaker is the standard Python implementation:

import pybreaker
breaker = pybreaker.CircuitBreaker(fail_max=10, reset_timeout=60)

@breaker
def call_upstream_protected(payload):
    return call_upstream(payload)

The breaker decorator wraps the retry-decorated function. Now each individual request gets the retry loop's protection, and the breaker tracks aggregate failures across requests.

Tuning the Two Together

The thresholds need to be tuned in concert. If the retry loop has a 5-attempt budget and the breaker opens after 10 failures, a single bad upstream period could cause 50 retry attempts (10 requests × 5 retries each) before the breaker opens. That might be fine; it might be too aggressive depending on your fleet size.

The rule of thumb I have settled on: set the breaker to open after a number of requests equal to roughly twice the count of clients you have hitting the upstream in parallel. This lets a few clients hit transient failures and recover before the breaker decides the upstream is genuinely down. The exact numbers depend on traffic patterns and how strict you want the failure detection to be.

The cooldown after the breaker opens should be at least as long as the typical recovery time of the upstream. For most internal services that recovery is a few minutes; for external services it can be much longer. Set the initial cooldown conservatively (60 seconds is a reasonable default) and tune based on observed behavior.

When the Combination Is Overkill

For low-volume jobs that run infrequently, the retry loop alone is usually enough. The circuit breaker pattern adds operational complexity that does not pay off when the failure rate is naturally low because the request volume is low.

For high-volume, mission-critical paths (every external API call your service makes during user-facing requests), the combination is worth the complexity. The cost of an outage propagating across the fleet, hammering an already-struggling upstream, is genuinely high in those environments.

The decision is the same shape as most architecture decisions: match the pattern complexity to the actual operational complexity of the system. Adding circuit breakers everywhere because they sound good is the kind of overengineering that creates more problems than it solves.

What Goes Wrong Most Often

The two most common failure modes I see when reviewing existing retry+breaker code:

The breaker's reset_timeout is too short. Half-open trial requests fire during the upstream's continued degradation, the trial fails, the breaker opens for another short cooldown, and the cycle repeats. The breaker spends most of its time flapping between half-open and open rather than providing useful protection.

The breaker's failure counting includes intentional 4xx responses that the application is expected to handle. A 400 Bad Request for malformed input is not a sign that the upstream is unhealthy; it is a sign that the request itself was wrong. Configure the breaker to only count 5xx responses and connection-level failures, not 4xx responses.

References

The longer walkthrough on how to build a self-healing retry strategy for data automation jobs covers the retry loop pattern end to end including failure classification, idempotency, and the circuit-breaker integration. A data automation team that builds these patterns regularly tends to settle on similar defaults across very different stacks; the convergence is a good signal that the underlying tradeoffs are well-understood.

The Wikipedia article on the circuit breaker design pattern covers the broader history and motivation. Pybreaker's documentation covers the Python implementation. Microsoft's cloud design patterns include both retry and circuit-breaker discussions worth reading.

The Pattern That Holds Up

Reach for the retry loop for any external call that might fail transiently. Reach for the circuit breaker for high-volume calls where a fleet of retrying clients could pile on a struggling upstream. The combination is the right default for production systems where on-call burden and upstream protection both matter.

Get the thresholds right by starting conservative and tuning based on observed behavior. Both patterns have well-tested library implementations; reach for those rather than building from scratch. The math is subtle and the failure modes are easy to get wrong.

The shape of the result is a system that absorbs transient failures invisibly, protects both ends of every call during longer outages, and only escalates to humans when the resilience layer has done what it can and exhausted its options. That is the system most teams actually want; it just takes building the two patterns together to get there.

DEV Community