What 5 cross-brand bridge failures in 60 minutes taught me about circuit breaker hold times

#python #automation #backend #patterns

We run a content distribution pipeline that publishes to 17 social platforms across 7 brands. Most platforms route through a single Chrome extension that drives signed-in browser sessions. When that extension dies, every platform on the bridge fails the same way: the publisher sends a payload, the bridge timeout fires, the publisher logs bridge returned None, and moves on.

For a long time, we let those failures accumulate freely. Each one cost 30 to 90 seconds of wall time. With 7 brands all hitting the same broken bridge, an outage burned 30+ minutes of attempt budget per hour for nothing.

So we wrapped a circuit breaker around the bridge. Five failures inside a 60-minute window across any combination of brands trips the gate. While the gate is open, every platform routed through the bridge skips the network call and returns a structured blocked reason in single-digit milliseconds. The first successful bridge-routed publish after the latest failure clears the gate immediately.

That is the standard pattern. Here is the part that took us three weeks of measurements to get right: the hold time.

First attempt: 15 minutes

Our first version held the gate open for 15 minutes. We chose 15 because most extension hangs we had seen recovered inside 10. The math seemed reasonable.

It was wrong. The autoflow handler that pushes content to the bridge has its own retry cadence, and it polls every 90 seconds. A 15-minute hold cleared right before the next batch of retries hit. The bridge would still be sick, the retries would burn through their attempt budget, fail again, and re-trip the breaker. We watched the dashboard ping-pong between open and tripped twelve times in one hour during a single underlying outage.

The fix: 30 minutes, with success-path bypass

The fix was to extend the hold to 30 minutes. Once the hold time exceeded the autoflow cadence by a comfortable margin, the breaker stopped flapping. Persistent outages cost the same number of failed attempts they always did, but the system stopped wasting the 11 minutes between cycles. Transient outages still recover in single-digit minutes because a single successful publish anywhere in the system clears the gate immediately, regardless of the hold ceiling.

def _bridge_down_blocked_until(platform):
    if platform not in BRIDGE_ROUTED_PLATFORMS:
        return None
    cutoff = now_utc() - timedelta(minutes=60)
    recent_fails = []
    latest_bridge_success = None
    for record in tail_publish_log(400):
        if record.platform not in BRIDGE_ROUTED_PLATFORMS:
            continue
        if record.ts < cutoff:
            continue
        if record.ok and record.route_used in BRIDGE_ALIVE_ROUTES:
            latest_bridge_success = max(latest_bridge_success or record.ts, record.ts)
        elif any(sig in record.error.lower() for sig in BRIDGE_FAIL_SIGNATURES):
            recent_fails.append(record.ts)
    if len(recent_fails) < 5:
        return None
    latest_fail = max(recent_fails)
    if latest_bridge_success and latest_bridge_success > latest_fail:
        return None
    block_until = latest_fail + timedelta(minutes=30)
    return block_until if block_until > now_utc() else None

Three things to notice in that snippet.

The breaker counts cross-brand failures. Per-brand counting on a shared extension never reaches the threshold fast enough.

The success-path check is explicit: any bridge-routed publish that succeeds on any brand on any bridge-routed platform proves the extension is alive. The breaker clears regardless of timer.

The only thing the timer does is bound the worst case when nothing is succeeding. It is not the primary signal.

Three lessons

One. The hold time is not about the underlying outage duration. It is about how much longer than the retry cadence you can hold the door shut before retries burn through their budget. If your retry policy is 90 seconds and your hold is 15 minutes, you are inviting flap. Hold longer than the longest retry interval you actually run, with margin.

Two. Cross-brand counts beat per-brand counts on shared infrastructure. When seven brands share one extension, each brand individually only sees one or two failures before the gate would otherwise need to open. Counting cross-brand makes the breaker trip on the actual problem (one shared dependency is dead) instead of on per-brand thresholds that mask the shared failure.

Three. Make the success path explicit. A single success from any brand on any bridge-routed platform is a definitive signal that the extension is alive. Code the cleanup against that signal. Do not wait for the timer.

The implementation is about 80 lines of Python. The hard parts were not the data structures. They were the timing parameters, and we only got those right after watching real production traffic ping-pong against a misconfigured hold time for several hours.

If you have a cross-platform bridge or a shared upstream that fails as a unit, this pattern will pay for itself the first day a transient hang lands during a content burst.