By definition, fault tolerance is the ability of a system to continue operating despite the failures in one of more components. This is highly applicable in terms of a distributed system application and in order to achieve it we need to take some actions. In this article, I'll cover up some strategies that can be followed, in the application level, to handle failures better overall.
When failures can happen
The main challenge in a distributed system is the communication between nodes, since they're exposed to network, software and hardware failures. Since we cannot handle directly hardware faults, we must ensure our hole system is designed to tolerate software/network failures at edge points. Overall, it's easier to ask when they don't fail actually.
Application-Level Failure Handling
All of our strategies will have the same objective: not overwhelm failing systems and cause node drops in our environment.
Idempotency
Given that we are exposed to network failures, we must ensure our system is idempotent when we can't rely on HTTP verbs. This can be achieved with different approaches, the most common ones being using idempotency keys and unique constraints in databases. In the example below I covered the usage of an idempotency key for processing the POST request in Go. Overall, the idea is to not have side effects if we have already processed the sent key.
// Mimic of the request handler
func HandleRequest(w http.ResponseWriter, r *http.Request) {
idempotencyKey := r.Header.Get("Idempotency-Key")
if len(idempotencyKey) == 0 {
slog.Error("error processing the request",
slog.String("key", idempotencyKey),
slog.String("err", "no idempotencyKey provided"),
)
w.WriteHeader(http.StatusBadRequest)
return
}
wasProcessed, err := checkKeyAlreadyProcessed(idempotencyKey)
if err != nil {
slog.Error("error checking if key was processed", slog.String("err", err.Error()))
w.WriteHeader(http.StatusInternalServerError)
return
}
if !wasProcessed {
err = process(idempotencyKey)
if err != nil {
slog.Error("error processing the request",
slog.String("key", idempotencyKey),
slog.String("err", err.Error()),
)
w.WriteHeader(http.StatusInternalServerError)
return
}
}
// Was already processed
w.WriteHeader(http.StatusCreated)
}
Timeout
This is the most basic tool we can use to avoid blocking our process and pile up the process queue. ALWAYS define timeout in calls considering the business rules of your domain.
func MakeRequest() error {
client := &http.Client{Timeout: 1 * time.Minute}
request, err := http.NewRequest("POST", "http://testing.com", nil)
if err != nil {
return fmt.Errorf("error creating the request %v", err)
}
// Adding the idempotency key header
request.Header.Add("Idempotency-Key", uuid.NewString())
response, err := client.Do(request)
if err != nil {
return fmt.Errorf("error procesing the request %v", err)
}
// rest of the code...
slog.String("HTTP code", response.Status)
return nil
}
Retries (Exponential Backoff + Jitter)
To add on the timeouts and idempotency keys, we can choose to, instead of in the first try returning to the client an error, handle failures in the application level. This approach increases latency overall in failing states but it is often the best choice, since the user will have to interact less with the UI in a failure scenario. Now our MakeRequest function will be enhanced because of:
- Exponential Backoff: instead of retrying with the same delay after each failure the interval is increased exponentially. It doesn't make sense to retry in sequence if an error happened without some kind of delay. This choice can be pretty useful to avoid overwhelimg the system in a faulty state.
- Jitter: a jitter is added to the delay to add some randomness to avoid sending concurrent retries in the exact same intervals, spreading the calls to the other application overt the time.
func MakeRequest() error {
const MAX_RETRIES = 3
const BASE_DELAY = 100 * time.Millisecond
const MAX_JITTER_MS = 100
// Create a new random source
r := rand.New(rand.NewSource(time.Now().UnixNano()))
client := &http.Client{Timeout: 1 * time.Minute}
// We'll store the last error to return if all retries fail
var lastErr error
var response *http.Response
for i := 0; i <= MAX_RETRIES; i++ {
request, err := http.NewRequest("POST", "http://testing.com", nil)
if err != nil {
// This is a non-retryable error
return fmt.Errorf("error creating the request %v", err)
}
request.Header.Add("Idempotency-Key", uuid.NewString())
response, err = client.Do(request)
lastErr = err
// Success = no network error AND a non-server-error (non-5xx) status.
// 4xx errors are client errors and typically not retryable.
if err == nil && response.StatusCode < 500 {
slog.Info("Request successful", "status", response.Status)
return nil
}
// If we're here, it was either a network error (err != nil)
// or a server error (response.StatusCode >= 500).
// Don't sleep if this was the last attempt
if i == MAX_RETRIES {
break
}
// Exponential backoff, with base 2^n
backoff := BASE_DELAY * time.Duration(math.Pow(2, float64(i)))
// Jitter: random duration between 0 and MAX_JITTER_MS
jitter := time.Duration(r.Intn(MAX_JITTER_MS)) * time.Millisecond
// Total sleep duration
sleepDuration := backoff + jitter
slog.Warn("Request failed, retrying",
slog.Int("attempt", i+1),
slog.String("sleep_duration", sleepDuration.String()),
slog.String("error", err.Error()),
)
// Wait before the next attempt
time.Sleep(sleepDuration)
}
// If the loop finishes, all retries have failed
if lastErr != nil {
return fmt.Errorf("all retries failed, last network error: %v", lastErr)
}
// Handle case where the last attempt was a 5xx error
return fmt.Errorf("all retries failed, last status: %s", response.Status)
}
Circuit Breaker
The Circuit Breaker is a robust design pattern to handle failing nodes in the
architecture. It introduces three states that describe an application state: circuit open,
circuit closed or circuit half-open.
- Closed: calls to the node are sent.
- Open: calls to the node won't be sent.
- Half-open: after some defined period on open state, the application goes into the half-open state and it tries to make a call. If this action results in a failure, the system goes back into open and, if it succeeds, it goes into closed.
Now our simple HTTP Request will be even more robust following the pattern.
First, let's defined our circuit breaker struct to handle the logic:
package cb
import (
"errors"
"fmt"
"log/slog"
"math"
"math/rand"
"net/http"
"sync"
"time"
"github.com/google/uuid"
)
// Define the states for the circuit breaker
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
// Sentinel error
var ErrCircuitOpen = errors.New("circuit breaker is open")
type CircuitBreaker struct {
mu sync.Mutex
state State
failures int
maxFailures int
openSince time.Time
openTimeout time.Duration
}
// NewCircuitBreaker creates a new circuit breaker with its thresholds
func NewCircuitBreaker(maxFailures int, openTimeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
state: StateClosed,
maxFailures: maxFailures,
openTimeout: openTimeout,
}
}
Now let's add the logic to handle state changes
// Checks if a request is allowed to proceed
func (cb *CircuitBreaker) CheckBeforeRequest() error {
cb.mu.Lock()
defer cb.mu.Unlock()
switch cb.state {
case StateClosed:
// Always allowed in a closed state
return nil
case StateOpen:
// Check if the open timeout has elapsed
if time.Since(cb.openSince) > cb.openTimeout {
// Timeout exceeded -> Half-Open
slog.Warn("Circuit Breaker: Open -> Half-Open")
cb.state = StateHalfOpen
return nil // Allow one test request to go through
}
// Still open
return ErrCircuitOpen
case StateHalfOpen:
// The circuit is already in a Half-Open state
// a test request is in flight. Reject all other concurrent requests
return ErrCircuitOpen
}
return nil
}
// OnSuccess notifies the breaker of a successful call
func (cb *CircuitBreaker) OnSuccess() {
cb.mu.Lock()
defer cb.mu.Unlock()
switch cb.state {
case StateHalfOpen:
// Test request succeeded -> close circuit
slog.Info("Circuit Breaker: Half-Open -> Closed")
cb.state = StateClosed
cb.failures = 0
case StateClosed:
// Reset consecutive failures
cb.failures = 0
}
}
// OnFailure notifies the breaker of a failed call
func (cb *CircuitBreaker) OnFailure() {
cb.mu.Lock()
defer cb.mu.Unlock()
switch cb.state {
case StateHalfOpen:
// The test request failed -> go into open state again
slog.Error("Circuit Breaker: Half-Open -> Open (test failed)")
cb.state = StateOpen
cb.openSince = time.Now() // Reset the open timer
case StateClosed:
cb.failures++
slog.Warn("Circuit Breaker: Failure recorded", "count", cb.failures)
// Check if we've reached the threshold
if cb.failures >= cb.maxFailures {
slog.Error("Circuit Breaker: Closed -> Open (threshold reached)")
cb.state = StateOpen
cb.openSince = time.Now()
}
}
}
Now we need to update our request handler to use the circuit breaker
// Handler will hold our client and the circuit breaker for this service
type Handler struct {
client *http.Client
cb *CircuitBreaker
}
// NewHandler creates a new handler
func NewHandler(cb *CircuitBreaker) *Handler {
return &Handler{
client: &http.Client{Timeout: 1 * time.Minute},
cb: cb,
}
}
func (h *Handler) MakeRequest() error {
if err := h.cb.CheckBeforeRequest(); err != nil {
// Circuit is Open or Half-Open -> fail fast
slog.Error("Request blocked by circuit breaker",
slog.String("error", err.Error(),
)
return err
}
err := h.attemptRequestWithRetry()
if err != nil {
// The operation failed after all retries
h.cb.OnFailure()
return err
}
// The operation succeeded
h.cb.OnSuccess()
return nil
}
// Function with retry encapsulated
func (h *Handler) attemptRequestWithRetry() error {
const MAX_RETRIES = 3
const BASE_DELAY = 100 * time.Millisecond
const MAX_JITTER_MS = 100
idempotencyKey = uuid.NewString()
r := rand.New(rand.NewSource(time.Now().UnixNano()))
var lastErr error
var response *http.Response
for i := 0; i <= MAX_RETRIES; i++ {
request, err := http.NewRequest("POST", "http://testing.com", nil)
if err != nil {
return fmt.Errorf("error creating the request %v", err)
}
request.Header.Add("Idempotency-Key", idempotencyKey)
// Use the handler's client
response, err = h.client.Do(request)
lastErr = err
if err == nil && response.StatusCode < 500 {
slog.Info("Request successful", "status", response.Status)
if response.Body != nil {
response.Body.Close()
}
return nil
}
if i == MAX_RETRIES {
break
}
// Calculate backoff and jitter
backoff := BASE_DELAY * time.Duration(math.Pow(2, float64(i)))
jitter := time.Duration(r.Intn(MAX_JITTER_MS)) * time.Millisecond
sleepDuration := backoff + jitter
// Safely create error message
errMsg := "server error"
if err != nil {
errMsg = err.Error()
}
slog.Warn("Request failed, retrying",
slog.Int("attempt", i+1),
slog.String("sleep_duration", sleepDuration.String()),
slog.String("error", err.Error()),
)
time.Sleep(sleepDuration)
}
// All retries failed
if lastErr != nil {
return fmt.Errorf("all retries failed, last network error: %v", lastErr)
}
return fmt.Errorf("all retries failed, last status: %s", response.Status)
}
Conclusion
You need to always consider how your system will handle failures when interacting with external services. Remember to consider the tradeoffs, even though it's more likely to have a fault-tolerant application, given its benefits. Please leave a like and a comment about this topic.


Top comments (0)