DEV Community

Cover image for Why Your Microservices Need Circuit Breakers (And How to Add Them)
Samson Tanimawo
Samson Tanimawo

Posted on

Why Your Microservices Need Circuit Breakers (And How to Add Them)

The Cascading Failure That Took Down Everything

Our payment service went down for 3 minutes. No big deal, right? Except every service that called payments kept retrying. The retry storms consumed all available connections. Within 10 minutes, all 12 services were down.

3 minutes of one service failing became 45 minutes of total outage.

Circuit breakers prevent this.

How Circuit Breakers Work

State Machine:

CLOSED ──(failures exceed threshold)──→ OPEN
↑ │
│ │
└──(success)──← HALF-OPEN ←──(timeout)──┘

CLOSED: Normal operation. Requests pass through.
Track failure rate.

OPEN: Requests fail immediately (fast failure).
No traffic to the struggling service.
Wait for timeout period.

HALF-OPEN: Allow one test request through.
If it succeeds → CLOSED
If it fails → OPEN
Enter fullscreen mode Exit fullscreen mode

Implementation in Python

import time
from enum import Enum
from threading import Lock

class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"

class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30,
success_threshold=3):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.success_threshold = success_threshold
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.lock = Lock()

def call(self, func, *args, **kwargs):
with self.lock:
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.success_count = 0
else:
raise CircuitBreakerOpenError(
f"Circuit breaker is OPEN. Retry after "
f"{self.recovery_timeout}s"
)

try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise

def _on_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
self.failure_count = 0

def _on_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

def process_payment(order):
try:
return payment_breaker.call(payment_service.charge, order)
except CircuitBreakerOpenError:
return queue_for_retry(order) # Graceful degradation
Enter fullscreen mode Exit fullscreen mode

What to Do When the Circuit Opens

The circuit breaker buys you time. Use it wisely:

def handle_open_circuit(service_name, request):
strategies = {
'payment': lambda r: queue_for_retry(r), # Retry later
'recommendations': lambda r: return_cached(r), # Serve stale data
'analytics': lambda r: drop_silently(r), # Non-critical, skip
'auth': lambda r: allow_with_cached_token(r), # Cached auth
'search': lambda r: return_popular_results(r), # Fallback results
}
return strategies.get(service_name, lambda r: return_error(r))(request)
Enter fullscreen mode Exit fullscreen mode

Monitoring Circuit Breakers

circuit_breaker_metrics:
- name: circuit_breaker_state
type: gauge
labels: [service, target]
# 0=closed, 1=open, 2=half_open

- name: circuit_breaker_failures_total
type: counter
labels: [service, target]

- name: circuit_breaker_rejected_total
type: counter
labels: [service, target]
# Requests rejected while circuit is open

alerts:
- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 1
for: 1m
severity: warning
message: "Circuit breaker for {{ $labels.target }} is OPEN"
Enter fullscreen mode Exit fullscreen mode

The Configuration That Matters

circuit_breakers:
payment-service:
failure_threshold: 5
recovery_timeout: 30s
success_threshold: 3
timeout_per_request: 5s

search-service:
failure_threshold: 10 # More tolerant
recovery_timeout: 15s # Recover faster
success_threshold: 2
timeout_per_request: 2s

auth-service:
failure_threshold: 3 # Less tolerant (critical)
recovery_timeout: 10s # Recover very fast
success_threshold: 1
timeout_per_request: 1s
Enter fullscreen mode Exit fullscreen mode

Critical services get lower thresholds (less tolerance) and faster recovery.

If you want AI-powered circuit breaker tuning and cascading failure prevention, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)