Rajkiran

Posted on Jun 12

System Design - 18. Fault Tolerance Patterns: Circuit Breakers, Bulkheads, and the Art of Failing Gracefully

#software #distributedsystems #systemdesign #microsoft

Covers: Circuit Breaker, Retry + Exponential Backoff + Jitter, Bulkhead, Timeout, Fallback, Redundancy

The Titanic's Bulkheads (And Why They Failed)

The RMS Titanic was designed with 16 watertight compartments — bulkheads. The idea: if the hull was breached, water would flood only the affected compartments, and the ship would stay afloat.

The fatal flaw: the bulkheads didn't extend high enough. Water flooding one compartment spilled over the top into the next, and the next, and the next. The isolation that was supposed to contain the damage didn't — because the walls were too short.

This is, almost too perfectly, the story of fault tolerance in distributed systems. The patterns exist. Teams implement them. But if implemented incompletely — bulkheads "too short" — a single failure cascades through the entire system anyway.

Today we cover the five patterns that, implemented correctly and together, are the difference between "one service had a bad day" and "the entire platform went down."

Why Failures Cascade: The Mechanism

Before the patterns, understand the failure mode they prevent. This is the cascading failure scenario from Day 2, now in full mechanical detail:

Step 1: Payment Service becomes slow (database under load, 5 seconds per call instead of 50ms)

Step 2: Order Service calls Payment Service, waits...
  Order Service has a thread pool of 100 threads
  Each call to Payment Service now holds a thread for 5 seconds (instead of 50ms)
  100x more threads are tied up per unit time

Step 3: Order Service's thread pool exhausts
  All 100 threads are blocked waiting on Payment Service
  New incoming requests to Order Service have no threads available
  Order Service starts rejecting/timing out ALL requests — 
  even ones that don't need Payment Service!

Step 4: Services calling Order Service experience the same problem
  Checkout Service calls Order Service → also times out
  Checkout Service's thread pool exhausts

Step 5: Cascade continues upward through the entire call graph
  The ENTIRE platform becomes unresponsive — 
  because ONE service (Payment) got slow.

The root cause: A slow dependency consumed a shared resource (threads) needed for unrelated operations. The fault tolerance patterns all attack this mechanism from different angles.

Pattern 1: Timeout — Never Wait Forever

The simplest, most fundamental pattern — and the one most commonly missing.

# WITHOUT timeout (dangerous default in many HTTP libraries)
response = requests.get("http://payment-service/charge")
# If payment-service hangs, this line waits FOREVER

# WITH timeout
response = requests.get("http://payment-service/charge", timeout=2.0)
# After 2 seconds with no response, raises a TimeoutError

Why this matters so much: Without a timeout, a hung dependency holds your thread indefinitely. With a timeout, the worst case is bounded — your thread is freed after 2 seconds, available for other work.

Choosing timeout values:

Too short: legitimate slow requests get cancelled unnecessarily
           (false failures under normal load spikes)

Too long:  threads tied up too long during real failures
           (cascading failure mechanism still triggers, just slower)

Rule of thumb: set timeout based on p99 latency of the dependency
  If p99 latency is 200ms → timeout at 500ms-1s
  Gives headroom for normal variance, fails fast for genuine hangs

Critical detail: Timeouts must be set at every layer — HTTP client, database driver, connection pool acquisition. A common bug: setting an HTTP timeout but the underlying TCP connection pool has no timeout, so threads still hang waiting for a connection from the pool.

Pattern 2: Retry with Exponential Backoff + Jitter

Transient failures (brief network blip, momentary overload) often succeed on retry. But naive retries can make things worse.

The Naive (Dangerous) Retry

def call_with_retry(url):
    for attempt in range(5):
        try:
            return requests.get(url, timeout=1)
        except RequestException:
            time.sleep(1)  # wait 1 second, retry
    raise Exception("All retries failed")

The problem: If Payment Service is overloaded and returning errors, and 1000 clients are all retrying every 1 second... you've just created a synchronized retry storm. Every client retries at the same intervals, hammering the already-struggling service in waves, preventing it from ever recovering.

Exponential Backoff

Increase the wait time between retries exponentially:

def call_with_exponential_backoff(url, max_retries=5):
    for attempt in range(max_retries):
        try:
            return requests.get(url, timeout=1)
        except RequestException:
            wait_time = 2 ** attempt  # 1s, 2s, 4s, 8s, 16s
            time.sleep(wait_time)
    raise Exception("All retries failed")

This gives the failing service progressively more breathing room. But there's still a problem.

Why Jitter Is Critical

Imagine 1000 clients all start their first retry at the same moment (because they all called at the same moment and all failed at the same moment). With pure exponential backoff:

All 1000 clients retry at: 1s, 2s, 4s, 8s, 16s...
→ Still synchronized! All 1000 hit the service AGAIN at exactly 1s, 
  then AGAIN at exactly 2s, etc.
→ The "thundering herd" pattern from caching (Day 3) — but for retries

Jitter adds randomness to break synchronization:

import random

def call_with_backoff_and_jitter(url, max_retries=5):
    for attempt in range(max_retries):
        try:
            return requests.get(url, timeout=1)
        except RequestException:
            base_wait = 2 ** attempt
            jitter = random.uniform(0, base_wait)  # random delay added
            time.sleep(jitter)
    raise Exception("All retries failed")

# Client A retries at: 0.3s, 1.8s, 5.1s, ...
# Client B retries at: 0.9s, 3.2s, 2.7s, ...
# Client C retries at: 0.1s, 0.4s, 9.3s, ...
# → Retries spread out over time, not synchronized

AWS's recommended "full jitter" formula:

wait_time = random.uniform(0, min(cap, base * (2 ** attempt)))
# cap = maximum wait time regardless of attempt number (e.g., 60s)

The interview answer: "Exponential backoff prevents hammering a recovering service with the same frequency. Jitter prevents synchronized retry storms across many clients. You need both — backoff alone still produces thundering herds at scale."

What NOT to retry: 4xx errors (client errors — retrying a malformed request won't fix it), and non-idempotent operations without an idempotency key (retrying a payment charge could double-charge — tie back to Day 5's Saga pattern).

Pattern 3: Circuit Breaker — Stop Calling What's Broken

If a dependency is consistently failing, continuing to call it — even with retries — wastes resources and prolongs the cascade. The Circuit Breaker pattern (named after electrical circuit breakers) stops calls entirely when a dependency is unhealthy.

The Three States

                    ┌─────────────────┐
        ┌──────────►│      CLOSED      │ (normal operation)
        │           │  Requests pass    │
        │           │  through normally │
        │           └─────────┬────────┘
        │                     │
        │      Failure rate exceeds threshold
        │      (e.g., 50% failures in 10s)
        │                     │
        │                     ▼
        │           ┌──────────────────┐
        │           │       OPEN        │ (failing fast)
        │           │  Requests fail    │
        │           │  IMMEDIATELY,     │
   Success          │  no call made     │
   threshold        └─────────┬────────┘
   reached                     │
        │             After timeout period
        │             (e.g., 30 seconds)
        │                     │
        │                     ▼
        │           ┌──────────────────┐
        └───────────┤    HALF-OPEN      │ (testing recovery)
                     │  Allow LIMITED    │
                     │  requests through │
                     │  to test if fixed │
                     └─────────┬────────┘
                                │
                       If test requests fail
                       → back to OPEN

CLOSED (normal): Requests flow through normally. The breaker monitors the failure rate.

OPEN (failing fast): Once the failure rate crosses a threshold, the breaker "trips." All subsequent requests fail immediately — without even attempting the network call. This is the key insight: failing fast and locally is far better than waiting for a timeout on every request to a known-broken service.

HALF-OPEN (testing recovery): After a cooldown period, the breaker allows a small number of test requests through. If they succeed, the breaker closes (back to normal). If they fail, it reopens (back to failing fast) and waits again.

Implementation Sketch

import time

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.state = "CLOSED"
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenException("Circuit is OPEN — failing fast")

        try:
            result = func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"  # Recovery confirmed
                self.failure_count = 0
            return result

        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise e

Real implementation: Netflix's Hystrix (now in maintenance mode) pioneered this for microservices. Resilience4j is the modern Java successor. Most languages have equivalents (e.g., pybreaker for Python, gobreaker for Go).

Why this matters at scale: If Payment Service is down, and Order Service makes 1000 requests/second to it, without a circuit breaker that's 1000 timeouts/second — each holding a thread for the timeout duration. With a circuit breaker in OPEN state, those 1000 requests fail in microseconds instead — freeing threads immediately, and giving Payment Service room to recover without being hammered.

Pattern 4: Bulkhead — Isolate Failure Domains

Named directly after the Titanic's compartments. The idea: partition resources (thread pools, connection pools) per dependency, so one dependency's failure can't exhaust resources needed for others.

Without Bulkheads (Shared Thread Pool)

Order Service has ONE thread pool of 100 threads, shared by all calls:
  - Calls to Payment Service
  - Calls to Inventory Service
  - Calls to Shipping Service

Payment Service hangs → 80 of 100 threads get stuck waiting on Payment
→ Only 20 threads remain for Inventory and Shipping calls
→ Inventory and Shipping requests queue up, time out
→ Even though Inventory and Shipping are perfectly healthy!

With Bulkheads (Isolated Thread Pools)

Order Service has SEPARATE thread pools per dependency:
  - Payment Service pool:   20 threads
  - Inventory Service pool: 20 threads
  - Shipping Service pool:  20 threads
  - (60 threads total, but partitioned)

Payment Service hangs → all 20 Payment-pool threads get stuck
→ Inventory pool (20 threads) and Shipping pool (20 threads) 
  are COMPLETELY UNAFFECTED
→ Inventory and Shipping requests continue normally

The trade-off: You're "wasting" capacity — if Payment's pool is exhausted but Inventory's pool is idle, you can't dynamically borrow threads. But this rigidity is exactly the point: it guarantees failure containment at the cost of some efficiency.

Resilience4j bulkhead configuration:

resilience4j.bulkhead:
  instances:
    paymentService:
      maxConcurrentCalls: 20
      maxWaitDuration: 10ms
    inventoryService:
      maxConcurrentCalls: 20
      maxWaitDuration: 10ms

Bulkhead vs Circuit Breaker — the distinction:

Bulkhead prevents resource exhaustion from spreading (isolation)
Circuit Breaker prevents wasted calls to a known-broken dependency (fail-fast)

They're complementary — bulkheads contain the blast radius, circuit breakers reduce wasted effort. Production systems use both together.

Pattern 5: Fallback — Degrade Gracefully

When a dependency is unavailable (circuit open, timeout, or error), what do you return to the user instead of an error?

def get_product_recommendations(user_id):
    try:
        return recommendation_service.get_personalized(user_id)
    except (CircuitOpenException, TimeoutError):
        # Fallback: return generic "trending" recommendations
        # instead of personalized ones
        return cache.get("trending_products")  # cached, always available

Fallback strategies, from best to worst degradation:

1. Cached/stale data
   "Here's your feed from 5 minutes ago" — better than nothing

2. Default/generic response
   "Here are trending products" instead of personalized recommendations

3. Reduced functionality
   "Search is temporarily unavailable, browse by category instead"

4. Queue for later
   "Your request is being processed" — async retry when service recovers

5. Honest error (last resort)
   "This feature is temporarily unavailable" — but the REST of the 
   page still works

The principle: partial degradation beats total failure. If your product page shows the product, price, and "Add to Cart" — but the "Customers also bought" section silently shows nothing (or cached trending items) because Recommendation Service is down — most users won't even notice. Compare that to the entire page returning a 500 error because one non-critical service failed.

Real example: Amazon's product pages are composed of dozens of independently-loaded widgets (price, reviews, recommendations, "frequently bought together"). Each widget fails independently with its own fallback. A Recommendation Service outage degrades one widget — the rest of the page works perfectly.

Pattern 6: Redundancy — Active-Active vs Active-Passive (Revisited)

From Day 1, but worth reinforcing in the fault tolerance context: redundancy is the foundation that makes the other patterns effective.

If there's only ONE instance of Payment Service:
  Circuit breaker trips → ALL payment requests fail
  (there's nothing else to fall back to)

If there are MULTIPLE instances across availability zones:
  Circuit breaker trips for the unhealthy instance
  Load balancer routes to healthy instances in other AZs
  Payment processing continues — degraded capacity, not total failure

Active-Active redundancy + Circuit Breakers + Bulkheads + Timeouts + Fallbacks together form a complete fault tolerance strategy. Remove any one, and the others are significantly weakened:

Without timeouts → circuit breakers can't detect failures fast enough
Without circuit breakers → retries continue hammering a dead service
Without bulkheads → one dependency's failure exhausts shared resources
Without fallbacks → circuit breaker "fails fast" just means failing faster, still an error to the user
Without redundancy → there's nothing to fail over to

Interview Scenario: "Design a Fault-Tolerant Payment Service Caller"

The complete answer, layering all patterns:

class PaymentServiceClient:
    def __init__(self):
        # Bulkhead: dedicated thread pool, isolated from other dependencies
        self.executor = ThreadPoolExecutor(max_workers=20)

        # Circuit breaker: stop calling if Payment Service is unhealthy
        self.circuit_breaker = CircuitBreaker(
            failure_threshold=5, 
            recovery_timeout=30
        )

    def charge(self, user_id, amount, idempotency_key):
        try:
            return self.circuit_breaker.call(
                self._charge_with_retry,
                user_id, amount, idempotency_key
            )
        except CircuitOpenException:
            # Fallback: queue for async retry, return "pending" to user
            queue.enqueue("retry_payment", {
                "user_id": user_id, 
                "amount": amount, 
                "idempotency_key": idempotency_key
            })
            return {"status": "pending", "message": "Processing your payment"}

    def _charge_with_retry(self, user_id, amount, idempotency_key):
        for attempt in range(3):
            try:
                # Timeout: never wait forever
                return requests.post(
                    "http://payment-service/charge",
                    json={"user_id": user_id, "amount": amount, 
                          "idempotency_key": idempotency_key},  # idempotent!
                    timeout=2.0
                )
            except (requests.Timeout, requests.ConnectionError) as e:
                if attempt == 2:
                    raise e
                # Exponential backoff + jitter
                wait = random.uniform(0, min(10, 2 ** attempt))
                time.sleep(wait)

This single code sample demonstrates: timeout, retry with backoff+jitter, idempotency (from Day 5), circuit breaker, bulkhead (separate executor), and fallback (queue for later). This is what "Top 1%" looks like in an interview — not naming the patterns, but composing them correctly together.

Key Takeaways

Cascading failures happen because a slow dependency consumes shared resources (threads) needed for unrelated work.
Timeout: never wait forever. Set based on p99 latency of the dependency, with headroom.
Retry with exponential backoff + jitter: backoff gives the dependency breathing room; jitter prevents synchronized retry storms across clients. Never retry non-idempotent operations without idempotency keys.
Circuit breaker: CLOSED → OPEN → HALF-OPEN. Fail fast locally instead of waiting for timeouts on a known-broken dependency.
Bulkhead: isolate thread/connection pools per dependency so one failure can't exhaust resources needed by others.
Fallback: degrade gracefully — cached data, generic defaults, reduced functionality — partial degradation beats total failure.
Redundancy (Active-Active) is the foundation — without something to fail over to, the other patterns just fail "faster," not "better."
All patterns work together. Removing any one significantly weakens the others.

You've now covered the entire microservices infrastructure layer: when and how to split a monolith (Topic 16), how services find each other (Topic 17), and how to keep one failing service from taking down everything else (Topic 18). This is the operational backbone of every production microservices system.

next we cover Security and Observability — OAuth2, JWT, the three pillars of observability (metrics, logs, traces), and rate limiting algorithms. The systems that protect your platform and tell you when something's wrong before your users do.

Tags: system-design fault-tolerance microservices resilience backend distributed-systems interview-prep

DEV Community