The Cascading Failure That Took Down Everything
Our payment service went down for 3 minutes. No big deal, right? Except every service that called payments kept retrying. The retry storms consumed all available connections. Within 10 minutes, all 12 services were down.
3 minutes of one service failing became 45 minutes of total outage.
Circuit breakers prevent this.
How Circuit Breakers Work
State Machine:
CLOSED ──(failures exceed threshold)──→ OPEN
↑ │
│ │
└──(success)──← HALF-OPEN ←──(timeout)──┘
CLOSED: Normal operation. Requests pass through.
Track failure rate.
OPEN: Requests fail immediately (fast failure).
No traffic to the struggling service.
Wait for timeout period.
HALF-OPEN: Allow one test request through.
If it succeeds → CLOSED
If it fails → OPEN
Implementation in Python
import time
from enum import Enum
from threading import Lock
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30,
success_threshold=3):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.success_threshold = success_threshold
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.lock = Lock()
def call(self, func, *args, **kwargs):
with self.lock:
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.success_count = 0
else:
raise CircuitBreakerOpenError(
f"Circuit breaker is OPEN. Retry after "
f"{self.recovery_timeout}s"
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
self.failure_count = 0
def _on_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
def process_payment(order):
try:
return payment_breaker.call(payment_service.charge, order)
except CircuitBreakerOpenError:
return queue_for_retry(order) # Graceful degradation
What to Do When the Circuit Opens
The circuit breaker buys you time. Use it wisely:
def handle_open_circuit(service_name, request):
strategies = {
'payment': lambda r: queue_for_retry(r), # Retry later
'recommendations': lambda r: return_cached(r), # Serve stale data
'analytics': lambda r: drop_silently(r), # Non-critical, skip
'auth': lambda r: allow_with_cached_token(r), # Cached auth
'search': lambda r: return_popular_results(r), # Fallback results
}
return strategies.get(service_name, lambda r: return_error(r))(request)
Monitoring Circuit Breakers
circuit_breaker_metrics:
- name: circuit_breaker_state
type: gauge
labels: [service, target]
# 0=closed, 1=open, 2=half_open
- name: circuit_breaker_failures_total
type: counter
labels: [service, target]
- name: circuit_breaker_rejected_total
type: counter
labels: [service, target]
# Requests rejected while circuit is open
alerts:
- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 1
for: 1m
severity: warning
message: "Circuit breaker for {{ $labels.target }} is OPEN"
The Configuration That Matters
circuit_breakers:
payment-service:
failure_threshold: 5
recovery_timeout: 30s
success_threshold: 3
timeout_per_request: 5s
search-service:
failure_threshold: 10 # More tolerant
recovery_timeout: 15s # Recover faster
success_threshold: 2
timeout_per_request: 2s
auth-service:
failure_threshold: 3 # Less tolerant (critical)
recovery_timeout: 10s # Recover very fast
success_threshold: 1
timeout_per_request: 1s
Critical services get lower thresholds (less tolerance) and faster recovery.
If you want AI-powered circuit breaker tuning and cascading failure prevention, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (1)
Discover how MFENX redefines computational trust with a verifiable, reproducible system. Its core crate is a carefully designed tool that makes computational trust explicit and consistent, setting it apart from conventional solutions.
For example, in a distributed data processing pipeline, MFENX ensures each transformation step yields identical results across environments by embedding verifiable proofs of execution. This allows teams to audit and reproduce outcomes with confidence and clarity.