DEV Community

架构师小白
架构师小白

Posted on

Circuit Breaker Pattern: Building Resilient Distributed Systems

Circuit Breaker Pattern Deep Guide: Building Resilient Distributed Systems

In distributed systems, a single service failure can cascade and cause the entire system to collapse. The Circuit Breaker Pattern is a core architectural pattern designed specifically to solve this critical problem.

Why Do We Need Circuit Breaker

Imagine a scenario where your microservice depends on an external payment gateway that normally responds within 100ms. But one day, the payment gateway experiences a failure and the response time jumps to 30 seconds. If you do not have proper protection:

  1. Resource Exhaustion: Requests pile up, thread pools exhausted
  2. Cascading Failure: Payment service unavailable, order service crashes
  3. System Avalanche: Entire system collapses within minutes

The core concept is similar to an electrical fuse - when anomalies detected, quickly trip to prevent failure spread.


Three States of Circuit Breaker

1. CLOSED (Normal Operation)

  • Normal state, execute calls normally
  • Record failures, transition to OPEN when threshold reached

2. OPEN (Tripped)

  • Service unavailable, fail fast
  • Return error or fallback response
  • Transition to HALF_OPEN after cooling timeout

3. HALF_OPEN (Testing)

  • Allow test requests
  • If success → CLOSED
  • If fail → OPEN

Core Implementation

import time
import threading
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60.0):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self._failure_count = 0
        self._last_failure_time = None
        self._state = CircuitState.CLOSED
        self._lock = threading.Lock()

    @property
    def state(self):
        with self._lock:
            if self._state == CircuitState.OPEN:
                if time.time() - self._last_failure_time >= self.recovery_timeout:
                    self._state = CircuitState.HALF_OPEN
            return self._state

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            raise Exception("Circuit is OPEN")
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception:
            self._on_failure()
            raise

    def _on_success(self):
        with self._lock:
            self._failure_count = 0
            self._state = CircuitState.CLOSED

    def _on_failure(self):
        with self._lock:
            self._failure_count += 1
            self._last_failure_time = time.time()
            if self._failure_count >= self.failure_threshold:
                self._state = CircuitState.OPEN
Enter fullscreen mode Exit fullscreen mode

Practical Example

payment_circuit = CircuitBreaker(failure_threshold=3, recovery_timeout=30.0)

def pay_order(order_id, amount):
    try:
        return payment_circuit.call(payment_gateway.charge, order_id)
    except Exception:
        # Fallback: queue for later retry
        payment_queue.enqueue({"order_id": order_id, "amount": amount})
        return {"status": "pending"}
Enter fullscreen mode Exit fullscreen mode

Integration With Other Patterns

1. Retry Pattern

Works with circuit breaker in half-open state. Exponential backoff provides better results.

2. Bulkhead Pattern

Circuit breaker protects overall, bulkhead protects individual components.

3. Fallback Pattern

Return degraded response when circuit is open.


Framework Support

  • Java: Spring @CircuitBreaker, Resilience4j
  • Python: PyBreaker
  • Go: Hystrix

Best Practices

  1. Set thresholds based on normal failure rates
  2. Monitor key metrics: state, failure rate, response time
  3. Set reasonable cooling time (30s to 5min)
  4. Always implement fallback handling
  5. Use distributed tracing for debugging

Summary

Circuit Breaker pattern is the foundation of building resilient distributed systems. By quickly tripping on failures, it prevents cascade failures and enables high-availability microservices architecture.

Top comments (0)