DEV Community

Cover image for Why Your Microservices Need Circuit Breakers (And How to Add Them)
Samson Tanimawo
Samson Tanimawo

Posted on

Why Your Microservices Need Circuit Breakers (And How to Add Them)

The Cascading Failure That Took Down Everything

Our payment service went down for 3 minutes. No big deal, right? Except every service that called payments kept retrying. The retry storms consumed all available connections. Within 10 minutes, all 12 services were down.

3 minutes of one service failing became 45 minutes of total outage.

Circuit breakers prevent this.

How Circuit Breakers Work

State Machine:

  CLOSED ──(failures exceed threshold)──→ OPEN
    ↑                                       │
    │                                       │
    └──(success)──← HALF-OPEN ←──(timeout)──┘

CLOSED:    Normal operation. Requests pass through.
           Track failure rate.

OPEN:      Requests fail immediately (fast failure).
           No traffic to the struggling service.
           Wait for timeout period.

HALF-OPEN: Allow one test request through.
           If it succeeds → CLOSED
           If it fails → OPEN
Enter fullscreen mode Exit fullscreen mode

Implementation in Python

import time
from enum import Enum
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30, 
                 success_threshold=3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.lock = Lock()

    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == CircuitState.OPEN:
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.success_count = 0
                else:
                    raise CircuitBreakerOpenError(
                        f"Circuit breaker is OPEN. Retry after "
                        f"{self.recovery_timeout}s"
                    )

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            self.failure_count = 0

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

def process_payment(order):
    try:
        return payment_breaker.call(payment_service.charge, order)
    except CircuitBreakerOpenError:
        return queue_for_retry(order)  # Graceful degradation
Enter fullscreen mode Exit fullscreen mode

What to Do When the Circuit Opens

The circuit breaker buys you time. Use it wisely:

def handle_open_circuit(service_name, request):
    strategies = {
        'payment': lambda r: queue_for_retry(r),           # Retry later
        'recommendations': lambda r: return_cached(r),      # Serve stale data
        'analytics': lambda r: drop_silently(r),            # Non-critical, skip
        'auth': lambda r: allow_with_cached_token(r),       # Cached auth
        'search': lambda r: return_popular_results(r),      # Fallback results
    }
    return strategies.get(service_name, lambda r: return_error(r))(request)
Enter fullscreen mode Exit fullscreen mode

Monitoring Circuit Breakers

circuit_breaker_metrics:
  - name: circuit_breaker_state
    type: gauge
    labels: [service, target]
    # 0=closed, 1=open, 2=half_open

  - name: circuit_breaker_failures_total
    type: counter
    labels: [service, target]

  - name: circuit_breaker_rejected_total
    type: counter
    labels: [service, target]
    # Requests rejected while circuit is open

alerts:
  - alert: CircuitBreakerOpen
    expr: circuit_breaker_state == 1
    for: 1m
    severity: warning
    message: "Circuit breaker for {{ $labels.target }} is OPEN"
Enter fullscreen mode Exit fullscreen mode

The Configuration That Matters

circuit_breakers:
  payment-service:
    failure_threshold: 5
    recovery_timeout: 30s
    success_threshold: 3
    timeout_per_request: 5s

  search-service:
    failure_threshold: 10   # More tolerant
    recovery_timeout: 15s   # Recover faster
    success_threshold: 2
    timeout_per_request: 2s

  auth-service:
    failure_threshold: 3    # Less tolerant (critical)
    recovery_timeout: 10s   # Recover very fast
    success_threshold: 1
    timeout_per_request: 1s
Enter fullscreen mode Exit fullscreen mode

Critical services get lower thresholds (less tolerance) and faster recovery.

If you want AI-powered circuit breaker tuning and cascading failure prevention, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (1)

Collapse
 
mfenx profile image
Julian Sanders

Discover how MFENX redefines computational trust with a verifiable, reproducible system. Its core crate is a carefully designed tool that makes computational trust explicit and consistent, setting it apart from conventional solutions.
For example, in a distributed data processing pipeline, MFENX ensures each transformation step yields identical results across environments by embedding verifiable proofs of execution. This allows teams to audit and reproduce outcomes with confidence and clarity.