Why Your Microservices Need Circuit Breakers (And How to Add Them)

#microservices #reliability #sre #devops

The Cascading Failure That Took Down Everything

Our payment service went down for 3 minutes. No big deal, right? Except every service that called payments kept retrying. The retry storms consumed all available connections. Within 10 minutes, all 12 services were down.

3 minutes of one service failing became 45 minutes of total outage.

Circuit breakers prevent this.

How Circuit Breakers Work

State Machine:

  CLOSED ──(failures exceed threshold)──→ OPEN
    ↑                                       │
    │                                       │
    └──(success)──← HALF-OPEN ←──(timeout)──┘

CLOSED:    Normal operation. Requests pass through.
           Track failure rate.

OPEN:      Requests fail immediately (fast failure).
           No traffic to the struggling service.
           Wait for timeout period.

HALF-OPEN: Allow one test request through.
           If it succeeds → CLOSED
           If it fails → OPEN

Implementation in Python

import time
from enum import Enum
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30, 
                 success_threshold=3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.lock = Lock()

    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == CircuitState.OPEN:
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.success_count = 0
                else:
                    raise CircuitBreakerOpenError(
                        f"Circuit breaker is OPEN. Retry after "
                        f"{self.recovery_timeout}s"
                    )

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            self.failure_count = 0

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

def process_payment(order):
    try:
        return payment_breaker.call(payment_service.charge, order)
    except CircuitBreakerOpenError:
        return queue_for_retry(order)  # Graceful degradation

What to Do When the Circuit Opens

The circuit breaker buys you time. Use it wisely:

def handle_open_circuit(service_name, request):
    strategies = {
        'payment': lambda r: queue_for_retry(r),           # Retry later
        'recommendations': lambda r: return_cached(r),      # Serve stale data
        'analytics': lambda r: drop_silently(r),            # Non-critical, skip
        'auth': lambda r: allow_with_cached_token(r),       # Cached auth
        'search': lambda r: return_popular_results(r),      # Fallback results
    }
    return strategies.get(service_name, lambda r: return_error(r))(request)

Monitoring Circuit Breakers

circuit_breaker_metrics:
  - name: circuit_breaker_state
    type: gauge
    labels: [service, target]
    # 0=closed, 1=open, 2=half_open

  - name: circuit_breaker_failures_total
    type: counter
    labels: [service, target]

  - name: circuit_breaker_rejected_total
    type: counter
    labels: [service, target]
    # Requests rejected while circuit is open

alerts:
  - alert: CircuitBreakerOpen
    expr: circuit_breaker_state == 1
    for: 1m
    severity: warning
    message: "Circuit breaker for {{ $labels.target }} is OPEN"

The Configuration That Matters

circuit_breakers:
  payment-service:
    failure_threshold: 5
    recovery_timeout: 30s
    success_threshold: 3
    timeout_per_request: 5s

  search-service:
    failure_threshold: 10   # More tolerant
    recovery_timeout: 15s   # Recover faster
    success_threshold: 2
    timeout_per_request: 2s

  auth-service:
    failure_threshold: 3    # Less tolerant (critical)
    recovery_timeout: 10s   # Recover very fast
    success_threshold: 1
    timeout_per_request: 1s