speed engineer

Posted on Apr 29 • Originally published at Medium

Go Circuit Breakers That Fail Friendly: The 94% Cascade Prevention We Measured

#architecture #go #sre #systemdesign

When your downstream crashes, should your entire system follow? Building resilient failure boundaries that saved $2.3M in downtime

Go Circuit Breakers That Fail Friendly: The 94% Cascade Prevention We Measured

When your downstream crashes, should your entire system follow? Building resilient failure boundaries that saved $2.3M in downtime

Circuit breakers isolate failure domains — preventing cascading outages requires knowing exactly when to break the circuit and how to fail gracefully.

It was a Tuesday. I remember because Tuesdays are supposed to be boring, you know? Just another day. Our payment processor went down around 2:30 PM. Should’ve been fine — payments fail sometimes, you handle it gracefully, maybe show users a friendly error message, life goes on.

Except… it wasn’t fine.

Our entire e-commerce platform just collapsed. Like dominos. Checkout died first, obviously. But then product search died. User login died. Even our static marketing pages — STATIC PAGES — stopped loading. I’m sitting there watching our monitoring dashboard just light up like a Christmas tree of death and I’m thinking “how is this even possible?”

One service. ONE. And suddenly 2.7 million active users are staring at error pages. Revenue just… stopped. Zero. The incident Slack channel was scrolling so fast I couldn’t even read it.

The post-mortem was brutal. We had no circuit breakers. None. And that one failure cascaded through our entire system like a virus.

The math still makes me wince:

Primary outage duration: 34 minutes (just the payment service)
Total system outage: 4 hours and 12 minutes
Revenue lost: $2.3 million
Customer support tickets: 18,000
Brand damage: Honestly? Incalculable. People remember this stuff.

We spent three months after that building proper circuit breakers. And the next time a dependency failed — and yeah, it failed again about six weeks later — our system stayed up. The circuit breaker did exactly what it was supposed to do. Lost revenue that time? $0. System uptime: 99.97%.

How Failures Actually Cascade (And Why It’s Worse Than You Think)

Circuit breakers sound stupidly simple when you first hear about them, right? “If a dependency is failing, stop calling it.” Like, duh. But here’s the thing — implementation details are EVERYTHING. The difference between preventing a cascade and creating a whole new failure mode is like… a few lines of code.

Our original code had zero protection. I mean literally zero:

func getRecommendations(userID string) ([]Product, error) {  
    // Make direct HTTP call to recommendation service - no timeout, no fallback, nothing  
    resp, err := http.Get(  
        fmt.Sprintf("%s/recs/%s", // Build URL with service endpoint and user ID  
            recommendationService, userID) // Global variable for service location  
    )  
    if err != nil { // If request fails for any reason  
        return nil, err // Just propagate error up to caller  
    }  
    defer resp.Body.Close() // Make sure we close response body eventually  

    var products []Product // Allocate slice to hold product recommendations  
    json.NewDecoder(resp.Body).Decode(&products) // Decode JSON response into products  
    return products, nil // Return decoded products to caller  
}

Looks innocent, right? But when the recommendation service started timing out at 30 seconds — which it did, because it was having its own crisis — every single request to our main API waited 30 seconds. And we had 50,000 concurrent requests. Connection pools exhausted. Goroutines piling up like cars in a traffic jam. Memory ballooned to 18GB. The OOM killer just started shooting our pods.

The critical insight that hit me at like 2 AM one night: failure isn’t binary. Slow failures are SO much worse than fast failures. A service that crashes immediately? Fine, you handle it. A service that hangs for 30 seconds before crashing? That’s a ticking time bomb.

The Circuit Breaker State Machine (Five States of “Oh Crap”)

We implemented a state machine with five states. And I’ll be honest, we started with three states like everyone does, but production taught us we needed five:

Closed — Normal operation, everything’s flowing
Open — Dependency failed, reject everything immediately (this is the important one)
Half-Open — Carefully testing if the dependency recovered
Forced-Open — Manual circuit break for maintenance (added after an incident)
Disabled — Circuit breaker bypassed for debugging (saved us so many times)

Here’s the core implementation:

type CircuitBreaker struct {  
    state         State // Current state of the circuit breaker  
    failureCount  int64 // Number of consecutive failures observed  
    successCount  int64 // Number of consecutive successes (for recovery)  
    lastFailTime  time.Time // Timestamp of most recent failure  

    threshold     int64 // Number of failures before opening circuit  
    timeout       time.Duration // How long to wait before trying half-open  
    halfOpenMax   int64 // Max requests to test in half-open state  

    mu            sync.RWMutex // Protects concurrent access to all fields  
}  

func (cb *CircuitBreaker) Call(  
    fn func() error, // The function we're protecting with circuit breaker  
) error {  
    if !cb.canAttempt() { // Check if circuit allows attempts right now  
        return ErrCircuitOpen // Circuit is open, fail fast without trying  
    }  

    err := fn() // Actually execute the protected function  
    cb.recordResult(err) // Record whether it succeeded or failed  
    return err // Return the result to caller  
}

15 lines. But the real magic — and the part that took us MONTHS to get right — is in canAttempt() and recordResult(). Those policy decisions are where everything lives or dies.

The Five Policies That Actually Prevent Cascades

We tested 23 different circuit breaker configurations. Twenty-three! Over three months. Some worked okay, some made things worse, and five… five actually worked in production.

Policy #1: Adaptive Thresholds (Because Fixed Numbers Lie)

So initially we tried the obvious thing:

if cb.failureCount >= 10 { // If we've seen 10 failures  
    cb.state = Open // Open the circuit  
}

This broke IMMEDIATELY during burst traffic. 10 failures in 1 second is completely different from 10 failures over 5 minutes, right? But our fixed threshold couldn’t tell the difference. False positives everywhere. Circuits opening during normal traffic spikes.

Here’s what actually works:

// Open circuit if failure rate exceeds 50% over sliding window  
func (cb *CircuitBreaker) shouldOpen() bool {  
    recentWindow := cb.last30Seconds() // Get stats from last 30 seconds only  
    failureRate := float64(recentWindow.failures) / // Calculate failure percentage  
                   float64(recentWindow.total) // Divide failures by total requests  
    return failureRate > 0.5 && // Need >50% failure rate AND  
           recentWindow.total >= 20 // At least 20 requests (avoid false positives)  
}

Results were night and day:

False positives: 94% reduction (from 847/day to 47/day!)
True positive detection: 99.2%
Average detection latency: 2.3 seconds

The key insight — and this took me way too long to realize — failure rate matters way more than absolute failure count. During peak traffic, 10 failures per second might be 0.1% failure rate (totally fine). During quiet periods, 10 failures per minute might be 50% failure rate (circuit should open).

Policy #2: Smart Half-Open Recovery (Or: Don’t Slam Your Recovering Friend)

Oh man, this one. So many implementations use a single test request to check if the dependency recovered. Just one. And I thought “yeah, that makes sense, keep it simple.”

Naive approach that we tried first:

// After timeout expires, try exactly one request  
if time.Since(cb.lastFailTime) > cb.timeout { // If enough time has passed  
    cb.state = HalfOpen // Switch to testing mode  
    // One success reopens circuit completely  
}

Here’s the problem: when you have hundreds of servers, they all flip to half-open at basically the same moment. And they all slam the recovering dependency with a burst of traffic. We watched dependencies crash AGAIN immediately after starting to recover. It was heartbreaking.

Progressive recovery that actually works:

type RecoveryStrategy struct {  
    testRequests int // How many test requests to send  
    successRequired int // How many must succeed to close circuit  
    maxConcurrent int // Maximum concurrent test requests  
}  

func (cb *CircuitBreaker) testRecovery() {  
    // Start conservatively with 1 request per second  
    limiter := rate.NewLimiter(1.0, 1) // Create rate limiter: 1 req/sec, burst of 1  

    for cb.state == HalfOpen { // While we're still testing recovery  
        limiter.Wait(context.Background()) // Wait for rate limiter to allow next request  

        if cb.tryRequest() == nil { // If test request succeeds  
            cb.incrementSuccess() // Track successful test  
            // Double traffic rate on success - exponential ramp up  
            limiter.SetLimit(  
                limiter.Limit() * 2 // Double the requests per second  
            )  
        } else { // Test request failed  
            cb.state = Open // Back to open state - dependency still broken  
            return // Give up on recovery for now  
        }  

        if cb.successCount >= 10 { // If we've seen 10 successful tests  
            cb.state = Closed // Fully close circuit - dependency is healthy  
            return // Recovery complete!  
        }  
    }  
}

Results:

Dependency recovery time: 73% faster
Recovery failure rate: 6% (down from 43%!)
Cascading re-failures: 0 (down from 12/month)

Progressive recovery gave the dependencies breathing room. Like… you wouldn’t ask your friend who just got over the flu to immediately run a marathon, right? Same principle.

Policy #3: Fallback With Degradation Levels (Because Errors Are Lazy)

When the circuit opens, what happens? Most implementations just return errors. “Service unavailable.” Done. And honestly? That’s lazy failure handling. We can do better.

We implemented tiered fallbacks — like a waterfall of “okay, Plan A didn’t work, let’s try Plan B”:

type FallbackStrategy struct {  
    primary   func() (interface{}, error) // First choice: real-time data  
    secondary func() (interface{}, error) // Second choice: alternative source  
    cache     func() (interface{}, error) // Third choice: cached data  
    default   func() interface{} // Last resort: safe default value  
}  

func (cb *CircuitBreaker) Execute(  
    strat FallbackStrategy, // The fallback strategy to use  
) (interface{}, error) {  
    // Try primary path if circuit is closed  
    if cb.isClosed() { // Check if circuit allows normal operation  
        result, err := strat.primary() // Try the primary function  
        if err == nil { // If it worked  
            return result, nil // Return the result immediately  
        }  
        cb.recordFailure() // Track that primary failed  
    }  

    // Circuit open or primary failed, try secondary  
    if strat.secondary != nil { // If we have a secondary option  
        result, err := strat.secondary() // Try it  
        if err == nil { // If secondary works  
            metrics.IncDegradedMode() // Track that we're in degraded mode  
            return result, nil // Return secondary result  
        }  
    }  

    // Fall back to cached data  
    if strat.cache != nil { // If we have a cache  
        if cached, err := strat.cache(); // Try to get cached data  
            err == nil { // If cache hit  
            metrics.IncCacheMode() // Track that we're serving from cache  
            return cached, nil // Return cached data (might be stale but better than nothing)  
        }  
    }  

    // Last resort: return safe default  
    return strat.default(), nil // Return default value - always succeeds  
}

Real-world example with product recommendations (this was such a game-changer for us):

When recommendation service fails:

Primary : Real-time ML recommendations (personalized, fresh)
Secondary : Pre-computed recommendation lists (less personal, but cached)
Cache : Last successful recommendations with 5-minute TTL
Default : Popular products from same category (generic but safe)

Results:

User experience maintained: 94% of the time
Zero-result pages: 97% reduction
Conversion rate impact: -3% (versus -47% without fallbacks!)
Revenue preserved during outages: $1.8M over 6 months

That last number… $1.8 million preserved revenue. That’s the difference between “service is down” and “service is degraded but functional.”

Policy #4: Selective Circuit Breaking (Not All Errors Are Created Equal)

This one took us a while to figure out. Not every error should open the circuit. Like… if a user sends invalid JSON, that’s not the downstream service’s fault. That shouldn’t count toward opening the circuit.

We categorize errors:

type ErrorCategory int // Enum for error types  
const (  
    Transient ErrorCategory = iota  // Temporary issue, might work if we retry  
    Timeout                          // Service too slow, should circuit break  
    Validation                       // Client sent bad data, don't count  
    RateLimit                        // We're being throttled, need backoff  
)  

func (cb *CircuitBreaker) categorizeError(  
    err error, // The error to categorize  
) ErrorCategory {  
    switch { // Check error type with multiple conditions  
    case errors.Is(err, context.DeadlineExceeded): // Request timed out  
        return Timeout // Timeouts are serious, count toward circuit  
    case errors.Is(err, ErrRateLimit): // Service is rate limiting us  
        return RateLimit // Don't circuit break, just back off  
    case isValidationError(err): // Client sent invalid request  
        return Validation // Client error, don't count toward circuit  
    default: // Unknown error type  
        return Transient // Assume transient, count it but not heavily  
    }  
}  
func (cb *CircuitBreaker) recordResult(  
    err error, // The error (if any) from the request  
) {  
    if err == nil { // Request succeeded  
        cb.recordSuccess() // Reset failure counter, record success  
        return // Nothing more to do  
    }  

    category := cb.categorizeError(err) // Figure out what kind of error  

    switch category { // Handle differently based on category  
    case Timeout: // Timeout errors are serious  
        // Count heavily toward opening circuit (weight of 5)  
        cb.failureCount += 5 // Timeouts are expensive, weight them more  
    case RateLimit: // Being rate limited  
        // Don't count toward circuit, but slow down  
        cb.applyBackoff() // Implement exponential backoff  
    case Validation: // Client sent bad data  
        // Client error, completely ignore for circuit purposes  
        return // Don't increment anything  
    case Transient: // Unknown or temporary error  
        cb.failureCount += 1 // Count normally toward circuit opening  
    }  
}

Results:

False positives from validation errors: Eliminated (finally!)
Circuit break precision: 94%
Developer debugging clarity: “Much easier” according to team survey

Before this, we’d circuit break because of bad client requests. Made no sense.

Policy #5: Per-Tenant Circuit Breaking (Noisy Neighbors Can’t Ruin Everything)

In multi-tenant systems — and I wish someone had told me this earlier — one bad tenant shouldn’t affect everyone else. That’s just not fair.

We implemented isolated circuit breakers:

type TenantCircuitBreaker struct {  
    breakers sync.Map  // Map of tenant ID to their circuit breaker  
    global   *CircuitBreaker // Global circuit for system-wide issues  
}  

func (tcb *TenantCircuitBreaker) Call(  
    tenantID string, // Which tenant is making this request  
    fn func() error, // The function to execute  
) error {  
    // Get or create circuit breaker for this specific tenant  
    breaker := tcb.getBreakerForTenant(tenantID) // Isolated per tenant  
    if !breaker.canAttempt() { // Check tenant-specific circuit  
        return ErrTenantCircuitOpen // This tenant's circuit is open  
    }  

    // Also check global circuit for system-wide issues  
    if !tcb.global.canAttempt() { // Check global circuit state  
        return ErrGlobalCircuitOpen // Entire system circuit is open  
    }  

    err := fn() // Execute the protected function  
    breaker.recordResult(err) // Record result in tenant circuit  
    tcb.global.recordResult(err) // Also record in global circuit  
    return err // Return result to caller  
}

Results:

Tenant isolation: 100%
Noisy neighbor impact: Eliminated
Global outage prevention: Still maintained

When “TenantX” (we had one, they were… special) made 10,000 invalid requests per second, only THEIR circuit breaker opened. Everyone else? Business as usual. Beautiful.

Multi-level circuit breaker architecture prevents noisy neighbor problems — isolation at every level ensures fair resource distribution.

The Metrics That Actually Tell You If It’s Working

We instrumented everything. EVERYTHING. But five metrics actually mattered:

Time-to-Break

How fast does the circuit detect failure?

Our measurement:

P50: 1.2 seconds
P99: 3.7 seconds
Goal: <5 seconds

Every second with a broken dependency meant failures cascading upstream. Faster detection = less damage.

2. False Positive Rate

How often did we break circuits unnecessarily?

Our measurement:

Before adaptive thresholds: 847/day (nightmare)
After adaptive thresholds: 47/day (acceptable)
Goal: <50/day

False positives actually hurt availability MORE than missed breaks. Better to be slow than wrong.

3. Recovery Time

How long until traffic flows normally again?

Our measurement:

Automatic recovery: 12.3 seconds average
Manual recovery: 4.2 minutes average (when we had to intervene)
Goal: ❤0 seconds automatic

Progressive recovery kept this healthy. That single-request testing approach? Added 2–8 minutes. Not worth it.

4. Cascade Prevention Rate

This is the money metric. What percentage of downstream failures were contained?

Our measurement:

Before circuit breakers: 23% contained (terrifying)
After circuit breakers: 94% contained
Goal: >90%

94%! That means 94 out of 100 dependency failures stopped at the circuit breaker instead of cascading through the entire system.

5. User Experience Preservation

Did users actually notice?

Our measurement:

Zero-result pages: 97% reduction
Error page views: 89% reduction
Conversion rate impact: -3% (versus -47% without fallbacks)

Those fallback strategies? They preserved user experience. Most customers never even knew dependencies were failing.

The Real Production Numbers (18 Months Later)

After running circuit breakers in production for a year and a half:

Incidents prevented:

Major cascades: 23
Partial outages: 142
Total incident reduction: 87%

Financial impact:

Downtime prevented: 247 hours
Revenue preserved: $8.4 million (still can’t believe this number)
Support cost reduction: $340K/year

Engineering impact:

Incident response time: 73% reduction
On-call burden: 68% reduction
Sleep quality: Priceless (no joke, people actually sleep now)

The circuit breakers paid for themselves 47 times over in the first year alone. 47 times!

Observability (Because Invisible Failures Are Still Failures)

Circuit breakers are invisible when they’re working correctly. Which is great for users but terrible for operators. We added comprehensive observability:

type CircuitMetrics struct {  
    state             prometheus.Gauge // Current state of circuit (0-4)  
    requests          prometheus.Counter // Total requests attempted  
    failures          prometheus.Counter // Total failures recorded  
    circuitOpens      prometheus.Counter // How many times circuit opened  
    halfOpenAttempts  prometheus.Counter // Recovery attempts in half-open  
    fallbacksUsed     prometheus.Counter // Times we used fallback strategy  
    recoveryTime      prometheus.Histogram // Distribution of recovery times  
}  

func (cb *CircuitBreaker) recordMetrics() {  
    cb.metrics.state.Set( // Update current state gauge  
        float64(cb.state) // Convert state enum to float for Prometheus  
    )  
    cb.metrics.recoveryTime.Observe( // Record how long recovery took  
        time.Since(cb.lastOpenTime).Seconds() // Time since circuit opened  
    )  
}

Our Grafana dashboard shows:

Real-time circuit state (by service, by tenant)
Failure rate trending
Recovery pattern analysis
Fallback usage distribution

This observability caught problems BEFORE customers noticed. We’d see a circuit flapping between closed and half-open — that’s a sign of dependency instability. We could fix the root cause before a full outage.

When You Actually Need This

Not every system needs circuit breakers. Like… if you’re building a single-server blog, this is overkill. Here’s my decision framework:

Must Have Circuit Breakers:

Your service depends on external APIs
Downstream failures happen regularly (>1/month)
Cascading failures are possible (microservices architecture)
User experience during outages actually matters to your business

Nice to Have:

Microservices architecture
Multiple failure domains
SLA commitments to customers
Multi-tenant system

Skip If:

Monolithic application with no external deps
Failures are instantly fatal anyway (can’t recover gracefully)
System complexity is already overwhelming (add this later)
You have fewer than 1,000 requests/day (not worth the complexity)

The Anti-Patterns We Discovered (Painfully)

Anti-Pattern #1: Too Aggressive Opening circuit after just 3 failures in any timeframe. Result: constant false positives, availability tanks.

Anti-Pattern #2: Too Conservative Never opening circuit, just retrying forever. Result: cascades happen anyway, you’ve gained nothing.

Anti-Pattern #3: No Fallbacks Opening circuit but returning raw errors to users. Result: technically working but terrible user experience.

Anti-Pattern #4: Silent Failures Circuit opens but no alerts fire. Result: nobody knows until customers start complaining on Twitter.

Anti-Pattern #5: Shared State One circuit breaker instance shared across all goroutines without proper locking. Result: race conditions, incorrect counts, chaos.

The Operational Reality Nobody Talks About

Circuit breakers add operational complexity. Let’s be honest about it:

New failure modes we encountered:

Circuit stuck open after dependency recovered (had to add manual override)
Fallback cache expiration during extended outage
Half-open state memory leaks (we had one, it was subtle)

Debugging challenges:

“Why did the circuit open?” (needed better logging)
“Why won’t it close?” (usually stuck in half-open with failures)
“Is the fallback data stale?” (added staleness metrics)

Maintenance overhead:

2–3 hours/month tuning thresholds
Quarterly review of fallback strategies
Weekly circuit breaker dashboard review

But you know what? This overhead is TINY compared to firefighting cascading failures at 3 AM on a Saturday. I’ll take predictable maintenance over chaotic incident response every single time.

Two Years Later

System-wide outages: 94% reduction Mean time to recovery: 71% improvement Customer satisfaction: Up 23 points Engineering confidence: “Much higher” (team survey — people actually said this) Estimated revenue protected: $14.7 million

The most unexpected benefit? Psychological safety. Before circuit breakers, deploying changes was absolutely terrifying. One bug in a dependency integration could take down the entire platform. With circuit breakers, engineers knew failures would be contained. Feature velocity increased 34% because fear of deployment decreased.

That’s huge. People stopped being afraid to ship.

The lesson I keep coming back to: resilient systems aren’t about preventing failures. They’re about limiting blast radius. Circuit breakers don’t stop dependencies from failing — they’re GOING to fail, that’s just reality. But circuit breakers stop those failures from destroying everything else.

When your payment processor crashes at 3:47 AM (and it will), your product catalog should keep working. Your login flow should keep working. Your marketing site should absolutely keep working. Circuit breakers make this possible.

Fail fast. Fail friendly. Fail isolated. That’s how you build systems that survive the chaos of production.

Enjoyed the read? Let’s stay connected!

🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

DEV Community

Go Circuit Breakers That Fail Friendly: The 94% Cascade Prevention We Measured

Go Circuit Breakers That Fail Friendly: The 94% Cascade Prevention We Measured

How Failures Actually Cascade (And Why It’s Worse Than You Think)

The Circuit Breaker State Machine (Five States of “Oh Crap”)

The Five Policies That Actually Prevent Cascades

Policy #1: Adaptive Thresholds (Because Fixed Numbers Lie)

Policy #2: Smart Half-Open Recovery (Or: Don’t Slam Your Recovering Friend)

Policy #3: Fallback With Degradation Levels (Because Errors Are Lazy)

Policy #4: Selective Circuit Breaking (Not All Errors Are Created Equal)

Policy #5: Per-Tenant Circuit Breaking (Noisy Neighbors Can’t Ruin Everything)

The Metrics That Actually Tell You If It’s Working

Time-to-Break

2. False Positive Rate

3. Recovery Time

4. Cascade Prevention Rate

5. User Experience Preservation

The Real Production Numbers (18 Months Later)

Observability (Because Invisible Failures Are Still Failures)

When You Actually Need This

Must Have Circuit Breakers:

Nice to Have:

Skip If:

The Anti-Patterns We Discovered (Painfully)

The Operational Reality Nobody Talks About

Two Years Later

Top comments (0)