The Problem: Cascade Failures
In a microservices architecture, services call each other. When Service B is slow or unavailable, Service A waits. Its threads pile up. Its connection pool exhausts. Now Service A is also unavailable. Anything depending on Service A fails. You have a cascade failure — a single sick service takes down the entire system.
The **Circuit Breaker** pattern (coined by Michael Nygard in *Release It!*) solves this by wrapping remote calls in a state machine that automatically stops sending requests to a failing service, giving it time to recover.
The three states:
- **Closed** — normal operation. Requests flow through. Failures are counted.
- **Open** — the circuit has tripped. Requests immediately fail fast (no network call). A timer starts.
- **Half-Open** — timer expired. One probe request is sent. Success closes the circuit; failure re-opens it.
## Python Implementation from Scratch
Understanding the implementation makes you a better consumer of libraries like `pybreaker` or `tenacity`.
```
`import time
import threading
from enum import Enum, auto
from typing import Callable, TypeVar, Any
from functools import wraps
T = TypeVar("T")
class CircuitState(Enum):
CLOSED = auto()
OPEN = auto()
HALF_OPEN = auto()
class CircuitBreakerError(Exception):
"""Raised when a call is rejected because the circuit is open."""
pass
class CircuitBreaker:
def init(
self,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
expected_exception: type[Exception] = Exception,
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception
self._state = CircuitState.CLOSED
self._failure_count = 0
self._last_failure_time: float | None = None
self._lock = threading.Lock()
@property
def state(self) -> CircuitState:
with self._lock:
if self._state == CircuitState.OPEN:
if (
self._last_failure_time is not None
and time.monotonic() - self._last_failure_time >= self.recovery_timeout
):
self._state = CircuitState.HALF_OPEN
return self._state
def call(self, func: Callable[..., T], *args: Any, **kwargs: Any) -> T:
state = self.state
if state == CircuitState.OPEN:
raise CircuitBreakerError(
f"Circuit is OPEN — calls to {func.__name__} are blocked"
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except self.expected_exception as e:
self._on_failure()
raise
def _on_success(self) -> None:
with self._lock:
self._failure_count = 0
self._state = CircuitState.CLOSED
def _on_failure(self) -> None:
with self._lock:
self._failure_count += 1
self._last_failure_time = time.monotonic()
if self._failure_count >= self.failure_threshold:
self._state = CircuitState.OPEN
def __call__(self, func: Callable) -> Callable:
"""Use as a decorator."""
@wraps(func)
def wrapper(*args, **kwargs):
return self.call(func, *args, **kwargs)
return wrapper
Usage as a decorator
payment_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60.0)
@payment_breaker
def charge_card(amount: float, card_id: str) -> dict:
# Simulates a call to a payment provider
import httpx
response = httpx.post(
"https://api.payments.example.com/charge",
json={"amount": amount, "card": card_id},
timeout=5.0,
)
response.raise_for_status()
return response.json()`
shell
## Using pybreaker (Production Library)
```
`pip install pybreaker`
```
`import pybreaker
import logging
pybreaker integrates with Python's logging
breaker = pybreaker.CircuitBreaker(
fail_max=5, # open after 5 consecutive failures
reset_timeout=30, # try half-open after 30s
exclude=[ValueError], # don't count these as failures
listeners=[
pybreaker.CircuitBreakerListener()
]
)
class BreakerListener(pybreaker.CircuitBreakerListener):
def state_change(self, cb, old_state, new_state):
logging.warning(
f"Circuit '{cb.name}': {old_state.name} -> {new_state.name}"
)
inventory_breaker = pybreaker.CircuitBreaker(
fail_max=3,
reset_timeout=15,
name="inventory-service",
listeners=[BreakerListener()],
)
@inventory_breaker
def get_inventory(product_id: str) -> dict:
import httpx
response = httpx.get(f"http://inventory-service/products/{product_id}", timeout=3.0)
response.raise_for_status()
return response.json()
Handling the open state gracefully
def get_product_details(product_id: str) -> dict:
try:
inventory = get_inventory(product_id)
except pybreaker.CircuitBreakerError:
# Fallback: return cached or degraded response
inventory = {"status": "unknown", "quantity": -1, "cached": True}
return inventory`
python
## Circuit Breaker + Retry: The Right Combination
Retries and circuit breakers are complementary but must be combined carefully. Naive retry inside a circuit breaker defeats the purpose — you're still hammering a sick service.
```
`import time
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_not_exception_type
import pybreaker
service_breaker = pybreaker.CircuitBreaker(fail_max=5, reset_timeout=30)
# Retry with exponential backoff, but NOT when circuit is open
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10),
retry=retry_if_not_exception_type(pybreaker.CircuitBreakerError),
reraise=True,
)
@service_breaker
def call_external_service(payload: dict) -> dict:
import httpx
response = httpx.post("https://external.api/process", json=payload, timeout=5.0)
response.raise_for_status()
return response.json()
# Caller handles both retry exhaustion and open circuit
try:
result = call_external_service({"data": "value"})
except pybreaker.CircuitBreakerError:
# Circuit is open — serve from cache or return degraded response
result = get_cached_response()
except httpx.HTTPStatusError as e:
# 3 retries exhausted — log and handle
logging.error(f"Service call failed after retries: {e}")`
## Node.js / TypeScript Implementation
```
`type CircuitState = "CLOSED" | "OPEN" | "HALF_OPEN";
interface CircuitBreakerOptions {
failureThreshold?: number;
recoveryTimeout?: number; // ms
}
class CircuitBreaker {
private state: CircuitState = "CLOSED";
private failureCount = 0;
private lastFailureTime?: number;
constructor(private opts: CircuitBreakerOptions = {}) {
this.opts.failureThreshold ??= 5;
this.opts.recoveryTimeout ??= 30_000;
}
private get currentState(): CircuitState {
if (this.state === "OPEN") {
const elapsed = Date.now() - (this.lastFailureTime ?? 0);
if (elapsed >= this.opts.recoveryTimeout!) {
this.state = "HALF_OPEN";
}
}
return this.state;
}
async call(fn: () => Promise): Promise {
if (this.currentState === "OPEN") {
throw new Error("Circuit is OPEN — request blocked");
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
private onSuccess(): void {
this.failureCount = 0;
this.state = "CLOSED";
}
private onFailure(): void {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.opts.failureThreshold!) {
this.state = "OPEN";
}
}
}
// Usage
const emailBreaker = new CircuitBreaker({ failureThreshold: 3, recoveryTimeout: 15_000 });
async function sendEmail(to: string, subject: string): Promise {
await emailBreaker.call(async () => {
const res = await fetch("https://api.emailprovider.com/send", {
method: "POST",
body: JSON.stringify({ to, subject }),
});
if (!res.ok) throw new Error(Email API error: ${res.status});
});
}`
python
## Monitoring Circuit Breaker State
Expose circuit state as a metric — an open circuit is a signal that requires immediate attention:
```
`from prometheus_client import Gauge, Counter
circuit_state_gauge = Gauge(
"circuit_breaker_state",
"Circuit breaker state (0=closed, 1=open, 2=half_open)",
["service_name"],
)
circuit_open_total = Counter(
"circuit_breaker_opened_total",
"Number of times circuit breaker opened",
["service_name"],
)
class ObservableCircuitBreaker(pybreaker.CircuitBreakerListener):
def __init__(self, service_name: str):
self.service_name = service_name
def state_change(self, cb, old_state, new_state):
state_map = {"closed": 0, "open": 1, "half-open": 2}
circuit_state_gauge.labels(self.service_name).set(
state_map.get(new_state.name.lower(), 0)
)
if new_state.name.lower() == "open":
circuit_open_total.labels(self.service_name).inc()`
## Common Pitfalls
### Shared circuit breaker for different failure modes
Create one circuit breaker per downstream dependency, not one global breaker. A slow payment service should not prevent calls to your user service.
### Counting timeouts but not 5xx errors
Configure your breaker to count both connection timeouts and HTTP 5xx responses as failures. A service returning 503s is as broken as one that times out.
### No fallback behavior
An open circuit should trigger a graceful degradation strategy — serve stale cached data, return a default response, or queue the request for later. Never just surface the CircuitBreakerError to the end user.
### Recovery timeout too short
If your recovery timeout (30s) is shorter than the time it takes a service to restart and become healthy, you'll immediately re-open the circuit on the first half-open probe. A reasonable starting point is 2–5x your service's typical startup time.
## Summary
- The circuit breaker's three states — **Closed, Open, Half-Open** — prevent cascade failures and give downstream services time to recover
- Use **pybreaker** in Python and build a thin class in Node.js/TypeScript
- Combine with **exponential backoff retries**, but skip retries when the circuit is open
- Always implement a **fallback** — stale cache, default value, or queuing — for when the circuit trips
- **Expose state as a metric** — an open circuit at 3am should page someone
### Deploy Resilient Microservices — Recommended Hosting
[
🌐
HostingerWeb Hosting from $2.99/mo
](https://www.hostinger.com/web-hosting?REFERRALCODE=88SHEZECLZMG)
[
💧
DigitalOcean$200 Free Credit
](https://www.digitalocean.com/?refcode=cd17c633ca0c)
Originally published at aiforeverthing.com
Top comments (0)