DEV Community

Manoir Yantai
Manoir Yantai

Posted on

Circuit Breakers: The Unsung Heroes of Resilient Microservices

When you’re running multiple services in production, failures are unavoidable. A downstream service might spike latency, return 500s, or disappear entirely. Without protection, a single fault can cascade across your system, wasting threads, exhausting connection pools, and eventually taking down dependent services. This is where circuit breakers shine—they degrade gracefully instead of amplifying failure.

You’ve probably used timeouts and retries, but those alone aren’t enough. Retries exacerbate overload, and timeouts still waste resources waiting. A circuit breaker monitors failures, and when they cross a threshold, it short-circuits the call, returning a predefined fallback immediately. This stops your service from burning CPU on doomed requests and lets downstream recover under reduced load.

The state machine is simple: closed (normal operation), open (rejecting requests), and half-open (probing for recovery). In closed state, every call is passed through; failures increment a counter. If the failure ratio exceeds your threshold (e.g., 50% of the last 10 calls), it trips to open. In open state, calls fail fast without reaching the remote service. After a configurable timeout, it moves to half-open and allows a few probes—if they succeed, it resets to closed; if not, it goes back to open.

Implementing this isn’t rocket science. Libraries like gobreaker in Go or resilience4j in Java abstract the boilerplate. Here’s a concise example in Go:

import (
 "fmt"
 "github.com/sony/gobreaker"
)

var cb *gobreaker.CircuitBreaker

func init() {
 cb = gobreaker.NewCircuitBreaker(gobreaker.Settings{
  Name:        "user-svc",
  MaxRequests: 3,
  Interval:    30 * time.Second,
  Timeout:     10 * time.Second,
  ReadyToTrip: func(c gobreaker.Counts) bool {
   return c.Requests >= 5 && float64(c.TotalFailures)/float64(c.Requests) > 0.5
  },
 })
}

func FetchUser(id string) (string, error) {
 result, err := cb.Execute(func() (interface{}, error) {
  resp, err := http.Get("http://user-service/" + id)
  if err != nil {
   return nil, err
  }
  defer resp.Body.Close()
  if resp.StatusCode >= 500 {
   return nil, fmt.Errorf("upstream error: %d", resp.StatusCode)
  }
  return readBody(resp)
 })
 if err != nil {
  return "", err // caller can choose fallback
 }
 return result.(string), nil
}
Enter fullscreen mode Exit fullscreen mode

This snippet tracks failures over 30-second windows. After 5 requests with a 50% failure rate, it opens for 10 seconds. During that window, Execute returns immediately, preserving your resources. The half-open probe allows 3 requests to verify recovery.

Now, don’t stop at basic implementation. Combine circuit breakers with other resilience patterns. Use a bulkhead to limit threads per breaker so one misbehaving service doesn’t exhaust your entire thread pool. Pair it with retries only for transient errors (e.g., 429 or 503) but cap retries and keep them out of the breaker’s failure count to avoid premature trips.

Monitoring is critical. Every state change should emit logs and metrics. Track trip rates, operation latency, and fallback invocations. If your breaker trips too often, your threshold might be too low—or the upstream is genuinely broken. Use separate settings per dependency; a critical user service can tolerate more failures than a logging endpoint.

Beware of common pitfalls. Don’t

Diagram

Top comments (0)