Covers: Circuit Breaker, Retry + Exponential Backoff + Jitter, Bulkhead, Timeout, Fallback, Redundancy
The Titanic's Bulkheads (And Why They Failed)
The RMS Titanic was designed with 16 watertight compartments — bulkheads. The idea: if the hull was breached, water would flood only the affected compartments, and the ship would stay afloat.
The fatal flaw: the bulkheads didn't extend high enough. Water flooding one compartment spilled over the top into the next, and the next, and the next. The isolation that was supposed to contain the damage didn't — because the walls were too short.
This is, almost too perfectly, the story of fault tolerance in distributed systems. The patterns exist. Teams implement them. But if implemented incompletely — bulkheads "too short" — a single failure cascades through the entire system anyway.
Today we cover the five patterns that, implemented correctly and together, are the difference between "one service had a bad day" and "the entire platform went down."
Why Failures Cascade: The Mechanism
Before the patterns, understand the failure mode they prevent. This is the cascading failure scenario from Day 2, now in full mechanical detail:
Step 1: Payment Service becomes slow (database under load, 5 seconds per call instead of 50ms)
Step 2: Order Service calls Payment Service, waits...
Order Service has a thread pool of 100 threads
Each call to Payment Service now holds a thread for 5 seconds (instead of 50ms)
100x more threads are tied up per unit time
Step 3: Order Service's thread pool exhausts
All 100 threads are blocked waiting on Payment Service
New incoming requests to Order Service have no threads available
Order Service starts rejecting/timing out ALL requests —
even ones that don't need Payment Service!
Step 4: Services calling Order Service experience the same problem
Checkout Service calls Order Service → also times out
Checkout Service's thread pool exhausts
Step 5: Cascade continues upward through the entire call graph
The ENTIRE platform becomes unresponsive —
because ONE service (Payment) got slow.
The root cause: A slow dependency consumed a shared resource (threads) needed for unrelated operations. The fault tolerance patterns all attack this mechanism from different angles.
Pattern 1: Timeout — Never Wait Forever
The simplest, most fundamental pattern — and the one most commonly missing.
# WITHOUT timeout (dangerous default in many HTTP libraries)
response = requests.get("http://payment-service/charge")
# If payment-service hangs, this line waits FOREVER
# WITH timeout
response = requests.get("http://payment-service/charge", timeout=2.0)
# After 2 seconds with no response, raises a TimeoutError
Why this matters so much: Without a timeout, a hung dependency holds your thread indefinitely. With a timeout, the worst case is bounded — your thread is freed after 2 seconds, available for other work.
Choosing timeout values:
Too short: legitimate slow requests get cancelled unnecessarily
(false failures under normal load spikes)
Too long: threads tied up too long during real failures
(cascading failure mechanism still triggers, just slower)
Rule of thumb: set timeout based on p99 latency of the dependency
If p99 latency is 200ms → timeout at 500ms-1s
Gives headroom for normal variance, fails fast for genuine hangs
Critical detail: Timeouts must be set at every layer — HTTP client, database driver, connection pool acquisition. A common bug: setting an HTTP timeout but the underlying TCP connection pool has no timeout, so threads still hang waiting for a connection from the pool.
Pattern 2: Retry with Exponential Backoff + Jitter
Transient failures (brief network blip, momentary overload) often succeed on retry. But naive retries can make things worse.
The Naive (Dangerous) Retry
def call_with_retry(url):
for attempt in range(5):
try:
return requests.get(url, timeout=1)
except RequestException:
time.sleep(1) # wait 1 second, retry
raise Exception("All retries failed")
The problem: If Payment Service is overloaded and returning errors, and 1000 clients are all retrying every 1 second... you've just created a synchronized retry storm. Every client retries at the same intervals, hammering the already-struggling service in waves, preventing it from ever recovering.
Exponential Backoff
Increase the wait time between retries exponentially:
def call_with_exponential_backoff(url, max_retries=5):
for attempt in range(max_retries):
try:
return requests.get(url, timeout=1)
except RequestException:
wait_time = 2 ** attempt # 1s, 2s, 4s, 8s, 16s
time.sleep(wait_time)
raise Exception("All retries failed")
This gives the failing service progressively more breathing room. But there's still a problem.
Why Jitter Is Critical
Imagine 1000 clients all start their first retry at the same moment (because they all called at the same moment and all failed at the same moment). With pure exponential backoff:
All 1000 clients retry at: 1s, 2s, 4s, 8s, 16s...
→ Still synchronized! All 1000 hit the service AGAIN at exactly 1s,
then AGAIN at exactly 2s, etc.
→ The "thundering herd" pattern from caching (Day 3) — but for retries
Jitter adds randomness to break synchronization:
import random
def call_with_backoff_and_jitter(url, max_retries=5):
for attempt in range(max_retries):
try:
return requests.get(url, timeout=1)
except RequestException:
base_wait = 2 ** attempt
jitter = random.uniform(0, base_wait) # random delay added
time.sleep(jitter)
raise Exception("All retries failed")
# Client A retries at: 0.3s, 1.8s, 5.1s, ...
# Client B retries at: 0.9s, 3.2s, 2.7s, ...
# Client C retries at: 0.1s, 0.4s, 9.3s, ...
# → Retries spread out over time, not synchronized
AWS's recommended "full jitter" formula:
wait_time = random.uniform(0, min(cap, base * (2 ** attempt)))
# cap = maximum wait time regardless of attempt number (e.g., 60s)
The interview answer: "Exponential backoff prevents hammering a recovering service with the same frequency. Jitter prevents synchronized retry storms across many clients. You need both — backoff alone still produces thundering herds at scale."
What NOT to retry: 4xx errors (client errors — retrying a malformed request won't fix it), and non-idempotent operations without an idempotency key (retrying a payment charge could double-charge — tie back to Day 5's Saga pattern).
Pattern 3: Circuit Breaker — Stop Calling What's Broken
If a dependency is consistently failing, continuing to call it — even with retries — wastes resources and prolongs the cascade. The Circuit Breaker pattern (named after electrical circuit breakers) stops calls entirely when a dependency is unhealthy.
The Three States
┌─────────────────┐
┌──────────►│ CLOSED │ (normal operation)
│ │ Requests pass │
│ │ through normally │
│ └─────────┬────────┘
│ │
│ Failure rate exceeds threshold
│ (e.g., 50% failures in 10s)
│ │
│ ▼
│ ┌──────────────────┐
│ │ OPEN │ (failing fast)
│ │ Requests fail │
│ │ IMMEDIATELY, │
Success │ no call made │
threshold └─────────┬────────┘
reached │
│ After timeout period
│ (e.g., 30 seconds)
│ │
│ ▼
│ ┌──────────────────┐
└───────────┤ HALF-OPEN │ (testing recovery)
│ Allow LIMITED │
│ requests through │
│ to test if fixed │
└─────────┬────────┘
│
If test requests fail
→ back to OPEN
CLOSED (normal): Requests flow through normally. The breaker monitors the failure rate.
OPEN (failing fast): Once the failure rate crosses a threshold, the breaker "trips." All subsequent requests fail immediately — without even attempting the network call. This is the key insight: failing fast and locally is far better than waiting for a timeout on every request to a known-broken service.
HALF-OPEN (testing recovery): After a cooldown period, the breaker allows a small number of test requests through. If they succeed, the breaker closes (back to normal). If they fail, it reopens (back to failing fast) and waits again.
Implementation Sketch
import time
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.state = "CLOSED"
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitOpenException("Circuit is OPEN — failing fast")
try:
result = func(*args, **kwargs)
if self.state == "HALF_OPEN":
self.state = "CLOSED" # Recovery confirmed
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
raise e
Real implementation: Netflix's Hystrix (now in maintenance mode) pioneered this for microservices. Resilience4j is the modern Java successor. Most languages have equivalents (e.g., pybreaker for Python, gobreaker for Go).
Why this matters at scale: If Payment Service is down, and Order Service makes 1000 requests/second to it, without a circuit breaker that's 1000 timeouts/second — each holding a thread for the timeout duration. With a circuit breaker in OPEN state, those 1000 requests fail in microseconds instead — freeing threads immediately, and giving Payment Service room to recover without being hammered.
Pattern 4: Bulkhead — Isolate Failure Domains
Named directly after the Titanic's compartments. The idea: partition resources (thread pools, connection pools) per dependency, so one dependency's failure can't exhaust resources needed for others.
Without Bulkheads (Shared Thread Pool)
Order Service has ONE thread pool of 100 threads, shared by all calls:
- Calls to Payment Service
- Calls to Inventory Service
- Calls to Shipping Service
Payment Service hangs → 80 of 100 threads get stuck waiting on Payment
→ Only 20 threads remain for Inventory and Shipping calls
→ Inventory and Shipping requests queue up, time out
→ Even though Inventory and Shipping are perfectly healthy!
With Bulkheads (Isolated Thread Pools)
Order Service has SEPARATE thread pools per dependency:
- Payment Service pool: 20 threads
- Inventory Service pool: 20 threads
- Shipping Service pool: 20 threads
- (60 threads total, but partitioned)
Payment Service hangs → all 20 Payment-pool threads get stuck
→ Inventory pool (20 threads) and Shipping pool (20 threads)
are COMPLETELY UNAFFECTED
→ Inventory and Shipping requests continue normally
The trade-off: You're "wasting" capacity — if Payment's pool is exhausted but Inventory's pool is idle, you can't dynamically borrow threads. But this rigidity is exactly the point: it guarantees failure containment at the cost of some efficiency.
Resilience4j bulkhead configuration:
resilience4j.bulkhead:
instances:
paymentService:
maxConcurrentCalls: 20
maxWaitDuration: 10ms
inventoryService:
maxConcurrentCalls: 20
maxWaitDuration: 10ms
Bulkhead vs Circuit Breaker — the distinction:
- Bulkhead prevents resource exhaustion from spreading (isolation)
- Circuit Breaker prevents wasted calls to a known-broken dependency (fail-fast)
They're complementary — bulkheads contain the blast radius, circuit breakers reduce wasted effort. Production systems use both together.
Pattern 5: Fallback — Degrade Gracefully
When a dependency is unavailable (circuit open, timeout, or error), what do you return to the user instead of an error?
def get_product_recommendations(user_id):
try:
return recommendation_service.get_personalized(user_id)
except (CircuitOpenException, TimeoutError):
# Fallback: return generic "trending" recommendations
# instead of personalized ones
return cache.get("trending_products") # cached, always available
Fallback strategies, from best to worst degradation:
1. Cached/stale data
"Here's your feed from 5 minutes ago" — better than nothing
2. Default/generic response
"Here are trending products" instead of personalized recommendations
3. Reduced functionality
"Search is temporarily unavailable, browse by category instead"
4. Queue for later
"Your request is being processed" — async retry when service recovers
5. Honest error (last resort)
"This feature is temporarily unavailable" — but the REST of the
page still works
The principle: partial degradation beats total failure. If your product page shows the product, price, and "Add to Cart" — but the "Customers also bought" section silently shows nothing (or cached trending items) because Recommendation Service is down — most users won't even notice. Compare that to the entire page returning a 500 error because one non-critical service failed.
Real example: Amazon's product pages are composed of dozens of independently-loaded widgets (price, reviews, recommendations, "frequently bought together"). Each widget fails independently with its own fallback. A Recommendation Service outage degrades one widget — the rest of the page works perfectly.
Pattern 6: Redundancy — Active-Active vs Active-Passive (Revisited)
From Day 1, but worth reinforcing in the fault tolerance context: redundancy is the foundation that makes the other patterns effective.
If there's only ONE instance of Payment Service:
Circuit breaker trips → ALL payment requests fail
(there's nothing else to fall back to)
If there are MULTIPLE instances across availability zones:
Circuit breaker trips for the unhealthy instance
Load balancer routes to healthy instances in other AZs
Payment processing continues — degraded capacity, not total failure
Active-Active redundancy + Circuit Breakers + Bulkheads + Timeouts + Fallbacks together form a complete fault tolerance strategy. Remove any one, and the others are significantly weakened:
- Without timeouts → circuit breakers can't detect failures fast enough
- Without circuit breakers → retries continue hammering a dead service
- Without bulkheads → one dependency's failure exhausts shared resources
- Without fallbacks → circuit breaker "fails fast" just means failing faster, still an error to the user
- Without redundancy → there's nothing to fail over to
Interview Scenario: "Design a Fault-Tolerant Payment Service Caller"
The complete answer, layering all patterns:
class PaymentServiceClient:
def __init__(self):
# Bulkhead: dedicated thread pool, isolated from other dependencies
self.executor = ThreadPoolExecutor(max_workers=20)
# Circuit breaker: stop calling if Payment Service is unhealthy
self.circuit_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=30
)
def charge(self, user_id, amount, idempotency_key):
try:
return self.circuit_breaker.call(
self._charge_with_retry,
user_id, amount, idempotency_key
)
except CircuitOpenException:
# Fallback: queue for async retry, return "pending" to user
queue.enqueue("retry_payment", {
"user_id": user_id,
"amount": amount,
"idempotency_key": idempotency_key
})
return {"status": "pending", "message": "Processing your payment"}
def _charge_with_retry(self, user_id, amount, idempotency_key):
for attempt in range(3):
try:
# Timeout: never wait forever
return requests.post(
"http://payment-service/charge",
json={"user_id": user_id, "amount": amount,
"idempotency_key": idempotency_key}, # idempotent!
timeout=2.0
)
except (requests.Timeout, requests.ConnectionError) as e:
if attempt == 2:
raise e
# Exponential backoff + jitter
wait = random.uniform(0, min(10, 2 ** attempt))
time.sleep(wait)
This single code sample demonstrates: timeout, retry with backoff+jitter, idempotency (from Day 5), circuit breaker, bulkhead (separate executor), and fallback (queue for later). This is what "Top 1%" looks like in an interview — not naming the patterns, but composing them correctly together.
Key Takeaways
- Cascading failures happen because a slow dependency consumes shared resources (threads) needed for unrelated work.
- Timeout: never wait forever. Set based on p99 latency of the dependency, with headroom.
- Retry with exponential backoff + jitter: backoff gives the dependency breathing room; jitter prevents synchronized retry storms across clients. Never retry non-idempotent operations without idempotency keys.
- Circuit breaker: CLOSED → OPEN → HALF-OPEN. Fail fast locally instead of waiting for timeouts on a known-broken dependency.
- Bulkhead: isolate thread/connection pools per dependency so one failure can't exhaust resources needed by others.
- Fallback: degrade gracefully — cached data, generic defaults, reduced functionality — partial degradation beats total failure.
- Redundancy (Active-Active) is the foundation — without something to fail over to, the other patterns just fail "faster," not "better."
- All patterns work together. Removing any one significantly weakens the others.
You've now covered the entire microservices infrastructure layer: when and how to split a monolith (Topic 16), how services find each other (Topic 17), and how to keep one failing service from taking down everything else (Topic 18). This is the operational backbone of every production microservices system.
next we cover Security and Observability — OAuth2, JWT, the three pillars of observability (metrics, logs, traces), and rate limiting algorithms. The systems that protect your platform and tell you when something's wrong before your users do.
Tags: system-design fault-tolerance microservices resilience backend distributed-systems interview-prep
Top comments (0)