Vincent Tommi

Posted on Sep 16

Understanding the Circuit Breaker Pattern in Distributed Systems day 52 of system design

#webdev #programming #productivity #python

In a distributed system, you never know how or when things might go wrong. Network glitches, component failures, or even a rogue router can wreak havoc. As a software engineer, it’s your job to keep these systems resilient and alive. Enter the Circuit Breaker Pattern—a design pattern that helps prevent cascading failures and keeps your services running smoothly.

In this article, we’ll dive into what the Circuit Breaker Pattern is, why it’s critical for microservices, and how it works with a practical use case. Let’s get started!

What is a Circuit Breaker?

If your house runs on electricity, you’re probably familiar with a circuit breaker. It’s an electrical switch that automatically cuts off power to protect your circuits from damage due to overloads (like a lightning strike) or short circuits. Its job? Stop the current flow when something goes wrong to protect your appliances.

The Circuit Breaker Pattern in software engineering works in a similar way. It’s designed to halt request-and-response processes when a service fails, preventing your system from spiraling into chaos. Let’s explore how.

What is the Circuit Breaker Pattern?

The Circuit Breaker Pattern stops a service call when it detects that the service is failing, much like its electrical namesake. Here’s how it works in a nutshell:

A consumer sends requests to multiple services, but one service is down due to technical issues.
Without a circuit breaker, the consumer keeps sending requests to the failed service, wasting resources and degrading performance.
The Circuit Breaker Pattern introduces a proxy that acts as a barrier between the consumer and the service.
When failures exceed a threshold, the circuit breaker trips, blocking further requests for a set time.
During this timeout, requests to the failed service are rejected immediately.
After the timeout, the circuit breaker allows a few test requests. If they succeed, it resumes normal operation; if they fail, the timeout restarts.

This pattern prevents resource exhaustion and ensures a better user experience by failing fast.

The Main Use Case: Employee Management System

To illustrate the Circuit Breaker Pattern, let’s use a microservices-based employee management system for a fictional company, Mercantile Finance. This system includes four services:

Service 1: Fetches personal information.
Service 2: Retrieves leave information.
Service 3: Provides employee performance data.
- Service 4: Handles allocation information.

These services are called using an aggregator pattern, where a proxy coordinates requests to multiple backend services. If one service fails, the entire system could suffer—unless we use a circuit breaker.

Why Availability Matters in Microservices ⏰

Availability is critical in microservices because downtime can add up quickly. Let’s say Mercantile Finance promises 99.999% uptime (a.k.a. "five nines"). Here’s how that translates:

Calculation:
24 hours/day × 365 days/year = 8,760 hours/year.
8,760 hours × 60 = 525,600 minutes/year.
99.999% uptime allows 0.001% downtime.
525,600 × 0.001% = 5.256 minutes of downtime per year.

For a monolithic system, 5.25 minutes of downtime is manageable. But in a microservices architecture with, say, 100 services, that’s 8.78 hours of downtime per year if each service fails independently. 😱 This is why protecting services with patterns like the Circuit Breaker is essential.

What Causes Services to Break?

Let’s explore two common failure scenarios in microservices and how they can cripple your system, using diagrams for clarity.

Use Case 1: Thread Starvation

Imagine a web server handling requests for five services. When a request arrives, the server allocates a thread to call the service. If one service is slow or fails, threads wait, tying up resources. For a high-demand service, more threads are allocated, leading to a queue of blocked requests.

diagram showing threads waiting for a slow service, causing a queue buildup.

If most threads are occupied by the failing service, incoming requests queue up, overwhelming the system. Even if the service recovers, the queued requests flood it, potentially causing another failure.

Use Case 2: Cascading Failures

Consider a chain of services: A → B → C → D. If Service D fails to respond, the failure propagates up the chain, causing a cascading failure.

diagram showing Service D’s failure causing Services C, B, and A to wait, leading to a cascading failure.

These scenarios highlight why we need a mechanism to detect and isolate failures quickly.

How the Circuit Breaker Pattern Saves the Day

The Circuit Breaker Pattern wraps service calls in a circuit breaker object that monitors for failures. It has three states:

Closed: Normal operation; requests pass through to the service.
Open: Too many failures detected; requests are blocked and return errors immediately.
Half-Open: After a timeout, a few test requests are allowed. If they succeed, the circuit returns to Closed; if they fail, it stays Open

diagram showing the Circuit Breaker’s state transitions (Closed, Open, Half-Open).

In our employee management system:

Suppose Service A (personal information) should respond within 200ms.
- 0–100ms: Normal operation.
- 100–200ms: Risky, but acceptable.
- >200ms: Failure; the circuit breaker trips.
If 75% of requests exceed 150ms, the circuit breaker detects a slow service.
If requests exceed 200ms, the proxy marks Service A as unresponsive and trips the circuit to Open.
Requests to Service A fail immediately with an error, preventing resource exhaustion.
In the background, the circuit breaker sends periodic ping requests to check if Service A recovers.
If response times return to normal, the circuit moves to Half-Open, allowing limited test requests. If successful, it resets to Closed.

Why Not Just Call the Service Directly?

You might wonder, "Why not let requests hit the failing service and timeout naturally?" Here’s why:

If each request waits for a 30-second timeout, all incoming requests queue up, consuming resources.
The Circuit Breaker Pattern avoids this by failing fast when a service exceeds its failure threshold, returning an error to the consumer immediately.
This prevents queues from forming and ensures the system remains responsive.

When Service A recovers, the circuit breaker reopens traffic, serving new requests without processing a backlog. This approach sacrifices a few requests to save the entire system from crashing.

Why Failing Fast is Better for Users

From a user’s perspective, waiting ages for a response is frustrating. The Circuit Breaker Pattern prioritizes a quick response—even if it’s an error—over keeping users hanging. By isolating failures, it prevents cascading issues and ensures the system recovers quickly.

Wrapping Up

The Circuit Breaker Pattern is a lifesaver in distributed systems, especially for microservices architectures. By monitoring service health, failing fast, and preventing resource exhaustion, it keeps your system resilient and your users happy.

DEV Community

Understanding the Circuit Breaker Pattern in Distributed Systems day 52 of system design

Top comments (0)