Circuit breakers and retry patterns: building resilient distributed systems
In a distributed system, every remote call can fail. Networks partition, services crash, and dependencies degrade. Retries and circuit breakers are your first line of defense against these failures. Used correctly, they make your system resilient. Used incorrectly, they make things worse.
Retries handle transient failures network timeouts, connection resets, temporary unavailability. Exponential backoff with jitter is the standard approach. Double the wait time between each retry up to a maximum, and add random jitter to spread out retry traffic. A typical pattern: 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, max out at 10s.
Not all errors should be retried. HTTP 4xx errors (except 429 and 408) indicate client errors and should fail fast. HTTP 5xx errors and network timeouts should be retried. Know the difference and configure your retry policy accordingly. Retrying a 400 Bad Request will never succeed and only wastes resources.
Set a maximum retry count and a deadline. A request that has been retried 5 times over 30 seconds is unlikely to succeed with more retries. Fail fast and let the upstream client handle the failure. Better to fail quickly than to accumulate retries that increase system load.
The circuit breaker pattern prevents retries from overwhelming a failing service. When the failure rate exceeds a threshold, the circuit opens and subsequent calls fail immediately without attempting the remote call. After a timeout, the circuit transitions to half-open and allows a few test requests.
Choose appropriate thresholds for your circuit breaker. A service that occasionally returns 503 during deployments needs different thresholds than a service that's failing due to an outage. Monitor circuit breaker state and alert when circuits open.
Implement retries and circuit breakers at every layer. The calling service retries, the API gateway retries, and the client retries. But use different timeouts at each layer so retries don't stack. The client should time out and retry at a coarser granularity than internal services.
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)