Dylan Dumont

Posted on Apr 27

Bulkhead vs Circuit Breaker: Choosing the Right Fault Isolation Strategy

#architecture #systems #patterns #distributed

Stop your entire system from collapsing because one microservice is choking.

What We're Building

We are designing a distributed system where dependencies inevitably fail. The goal is to contain that failure to prevent cascading outages. This article contrasts the Circuit Breaker pattern, which stops retrying failed operations, with the Bulkhead pattern, which limits resource consumption per subsystem. We will implement these strategies in Go, leveraging concurrency primitives that reflect production reality. We distinguish between failing fast and resource isolation to determine the correct architectural tradeoff for your infrastructure.

Step 1 — Visualizing Cascading Failure

Cascading failure occurs when one service's overload consumes system-wide resources like threads or bandwidth. Understanding the flow is the first defense.

[Healthy Service]
      |
      v
[Overloaded Service] --> [System Threads]
      |                    ^
      |---------------------|
      |                    |
   [Cascade to DB]     [Thread Pool Starvation]

Without isolation, a spike in load to a dependency drains the pool, causing healthy paths to starve. This visualizes the critical need to prevent a single failure point from consuming a shared resource like a thread pool. The choice matters because preventing resource starvation is distinct from preventing logic errors from propagating.

Step 2 — Implementing a Circuit Breaker

A Circuit Breaker detects repeated failures and opens the circuit to bypass the failing downstream service. In Go, we simulate this with state tracking and timeouts.

type CircuitBreaker struct {
    failureThreshold   uint
    resetTimeout       time.Duration
    state              State // CLOSED, OPEN, HALF_OPEN
}

func (c *CircuitBreaker) Execute(ctx context.Context, fn func() error) error {
    if c.state == StateOpen {
        c.resetTimer()
        if time.Since(c.lastTrip) < c.resetTimeout {
            return errors.New("circuit is open")
        }
    }
    err := fn()
    if err != nil {
        c.failures++
        if c.failures >= c.failureThreshold {
            c.state = StateOpen
            c.lastTrip = time.Now()
        }
        return err
    }
    c.failures = 0
    return nil
}

This implementation uses state tracking rather than a library dependency to demonstrate core logic. Using this custom struct ensures you understand the reset timer and threshold mechanics. It matters because standard library wrappers often hide the internal timing logic you need for tuning.

Step 3 — Conceptualizing the Bulkhead

A Bulkhead pattern limits resource access per service using logical barriers, like thread pools or semaphores. It does not stop failures; it stops resource exhaustion from affecting unrelated paths.

[Main Pool]          [Pool A]          [Pool B]
      |                |                |
  Service 1       Service 2      Service 3 (External DB)

This diagram shows logical isolation where a spike in Service 3 cannot exhaust the Main Pool. Allocating separate execution pools or connection limits for specific dependencies is the core concept here. It matters because circuit breakers protect against errors, but bulkheads protect against resource exhaustion.

Step 4 — Implementing Bulkhead Isolation

In Go, we use a semaphore-based pool to limit concurrency per service group. We define specific limits per dependency group rather than a global limit.

type Bulkhead struct {
    maxConcurrentPerGroup map[string]uint
}

func (b *Bulkhead) Acquire(key string) {
    // Acquire token from the specific semaphore
    // If limit reached, block or return error
}

func (b *Bulkhead) ExecuteWithBulkhead(key string, fn func()) error {
    token, err := b.acquireToken(key)
    if err != nil {
        return err
    }
    defer b.releaseToken(key, token)
    return fn()
}

We use a map to track distinct semaphore limits keyed by the service identifier. Defining distinct limits per service rather than a global limit allows partial system failure. This choice matters because a global thread pool is insufficient for modern microservice architectures where dependencies vary in cost.

Step 5 — Combining Both Strategies

Production systems often require both patterns for different layers. You might use Circuit Breakers for external API calls and Bulkheads for internal worker pools.

[Request]
   |
   v
[Bulkhead Pool] --> [Circuit Breaker] --> [External API]
   |
   | [Circuit Breaker]
   | [Bulkhead Pool]
   v
[Internal DB]

Applying both ensures that resource contention doesn't happen alongside error propagation. You might handle connection limits in the Bulkhead and handle timeout/failure limits in the Circuit Breaker. Combining both ensures that resource contention doesn't happen alongside error propagation during high load.

Key Takeaways

Circuit Breakers protect against repeated logic errors by stopping retries after a threshold.
Bulkheads protect against resource starvation by limiting concurrent execution per group.
Combine Strategies use bulkheads for internal resource limits and breakers for external network dependencies.
Monitor Metrics track both failure rates and semaphore wait times to validate effectiveness.

What's Next?

Implementing these patterns is only the beginning. Your next priority should be observability to detect threshold breaches before they impact users. Consider implementing metrics for failure rates and semaphore wait times to validate effectiveness. Finally, explore retry logic that is safe for idempotent operations to complement the breakers.

DEV Community