DEV Community

Cover image for Fault Tolerance
Igor Grieder
Igor Grieder

Posted on

Fault Tolerance

By definition, fault tolerance is the ability of a system to continue operating despite the failures in one of more components. This is highly applicable in terms of a distributed system application and in order to achieve it we need to take some actions. In this article, I'll cover up some strategies that can be followed, in the application level, to handle failures better overall.

When failures can happen

The main challenge in a distributed system is the communication between nodes, since they're exposed to network, software and hardware failures. Since we cannot handle directly hardware faults, we must ensure our hole system is designed to tolerate software/network failures at edge points. Overall, it's easier to ask when they don't fail actually.

Application-Level Failure Handling

All of our strategies will have the same objective: not overwhelm failing systems and cause node drops in our environment.

Idempotency

Given that we are exposed to network failures, we must ensure our system is idempotent when we can't rely on HTTP verbs. This can be achieved with different approaches, the most common ones being using idempotency keys and unique constraints in databases. In the example below I covered the usage of an idempotency key for processing the POST request in Go. Overall, the idea is to not have side effects if we have already processed the sent key.

// Mimic of the request handler
func HandleRequest(w http.ResponseWriter, r *http.Request) {
    idempotencyKey := r.Header.Get("Idempotency-Key")

    if len(idempotencyKey) == 0 {
        slog.Error("error processing the request",
            slog.String("key", idempotencyKey),
            slog.String("err", "no idempotencyKey provided"),
        )

        w.WriteHeader(http.StatusBadRequest)
        return

    }

    wasProcessed, err := checkKeyAlreadyProcessed(idempotencyKey)
    if err != nil {
        slog.Error("error checking if key was processed", slog.String("err", err.Error()))

        w.WriteHeader(http.StatusInternalServerError)
        return
    }

    if !wasProcessed {
        err = process(idempotencyKey)

        if err != nil {
            slog.Error("error processing the request",
                slog.String("key", idempotencyKey),
                slog.String("err", err.Error()),
            )

            w.WriteHeader(http.StatusInternalServerError)
            return
        }
    }

    // Was already processed
    w.WriteHeader(http.StatusCreated)
}
Enter fullscreen mode Exit fullscreen mode

Timeout

This is the most basic tool we can use to avoid blocking our process and pile up the process queue. ALWAYS define timeout in calls considering the business rules of your domain.

func MakeRequest() error {
    client := &http.Client{Timeout: 1 * time.Minute}
    request, err := http.NewRequest("POST", "http://testing.com", nil)
    if err != nil {
        return fmt.Errorf("error creating the request %v", err)
    }

    // Adding the idempotency key header
    request.Header.Add("Idempotency-Key", uuid.NewString())

    response, err := client.Do(request)
    if err != nil {
        return fmt.Errorf("error procesing the request %v", err)
    }

    // rest of the code...
    slog.String("HTTP code", response.Status)
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Retries (Exponential Backoff + Jitter)

To add on the timeouts and idempotency keys, we can choose to, instead of in the first try returning to the client an error, handle failures in the application level. This approach increases latency overall in failing states but it is often the best choice, since the user will have to interact less with the UI in a failure scenario. Now our MakeRequest function will be enhanced because of:

  • Exponential Backoff: instead of retrying with the same delay after each failure the interval is increased exponentially. It doesn't make sense to retry in sequence if an error happened without some kind of delay. This choice can be pretty useful to avoid overwhelimg the system in a faulty state.
  • Jitter: a jitter is added to the delay to add some randomness to avoid sending concurrent retries in the exact same intervals, spreading the calls to the other application overt the time.
func MakeRequest() error {
    const MAX_RETRIES = 3
    const BASE_DELAY = 100 * time.Millisecond
    const MAX_JITTER_MS = 100

    // Create a new random source
    r := rand.New(rand.NewSource(time.Now().UnixNano()))

    client := &http.Client{Timeout: 1 * time.Minute}

    // We'll store the last error to return if all retries fail
    var lastErr error
    var response *http.Response

    for i := 0; i <= MAX_RETRIES; i++ {
        request, err := http.NewRequest("POST", "http://testing.com", nil)
        if err != nil {
            // This is a non-retryable error
            return fmt.Errorf("error creating the request %v", err)
        }
        request.Header.Add("Idempotency-Key", uuid.NewString())

        response, err = client.Do(request)
        lastErr = err

        // Success = no network error AND a non-server-error (non-5xx) status.
        // 4xx errors are client errors and typically not retryable.
        if err == nil && response.StatusCode < 500 {
            slog.Info("Request successful", "status", response.Status)
            return nil
        }

        // If we're here, it was either a network error (err != nil)
        // or a server error (response.StatusCode >= 500).
        // Don't sleep if this was the last attempt
        if i == MAX_RETRIES {
            break
        }

        // Exponential backoff, with base 2^n
        backoff := BASE_DELAY * time.Duration(math.Pow(2, float64(i)))

        // Jitter: random duration between 0 and MAX_JITTER_MS
        jitter := time.Duration(r.Intn(MAX_JITTER_MS)) * time.Millisecond

        // Total sleep duration
        sleepDuration := backoff + jitter

        slog.Warn("Request failed, retrying",
            slog.Int("attempt", i+1),
            slog.String("sleep_duration", sleepDuration.String()),
            slog.String("error", err.Error()),
        )

        // Wait before the next attempt
        time.Sleep(sleepDuration)
    }

    // If the loop finishes, all retries have failed
    if lastErr != nil {
        return fmt.Errorf("all retries failed, last network error: %v", lastErr)
    }

    // Handle case where the last attempt was a 5xx error
    return fmt.Errorf("all retries failed, last status: %s", response.Status)
}
Enter fullscreen mode Exit fullscreen mode

Circuit Breaker

The Circuit Breaker is a robust design pattern to handle failing nodes in the
architecture. It introduces three states that describe an application state: circuit open,
circuit closed or circuit half-open.

  • Closed: calls to the node are sent.
  • Open: calls to the node won't be sent.
  • Half-open: after some defined period on open state, the application goes into the half-open state and it tries to make a call. If this action results in a failure, the system goes back into open and, if it succeeds, it goes into closed.

Now our simple HTTP Request will be even more robust following the pattern.
First, let's defined our circuit breaker struct to handle the logic:

package cb

import (
    "errors"
    "fmt"
    "log/slog"
    "math"
    "math/rand"
    "net/http"
    "sync"
    "time"

    "github.com/google/uuid"
)

// Define the states for the circuit breaker
type State int

const (
    StateClosed State = iota
    StateOpen
    StateHalfOpen
)

// Sentinel error
var ErrCircuitOpen = errors.New("circuit breaker is open")

type CircuitBreaker struct {
    mu          sync.Mutex
    state       State
    failures    int
    maxFailures int
    openSince   time.Time
    openTimeout time.Duration
}

// NewCircuitBreaker creates a new circuit breaker with its thresholds
func NewCircuitBreaker(maxFailures int, openTimeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        state:       StateClosed,
        maxFailures: maxFailures,
        openTimeout: openTimeout,
    }
}
Enter fullscreen mode Exit fullscreen mode

Now let's add the logic to handle state changes

// Checks if a request is allowed to proceed
func (cb *CircuitBreaker) CheckBeforeRequest() error {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    switch cb.state {
    case StateClosed:
        // Always allowed in a closed state
        return nil

    case StateOpen:
        // Check if the open timeout has elapsed
        if time.Since(cb.openSince) > cb.openTimeout {
            // Timeout exceeded -> Half-Open
            slog.Warn("Circuit Breaker: Open -> Half-Open")
            cb.state = StateHalfOpen
            return nil // Allow one test request to go through
        }

        // Still open
        return ErrCircuitOpen

    case StateHalfOpen:
        // The circuit is already in a Half-Open state
        // a test request is in flight. Reject all other concurrent requests
        return ErrCircuitOpen
    }
    return nil
}

// OnSuccess notifies the breaker of a successful call
func (cb *CircuitBreaker) OnSuccess() {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    switch cb.state {
    case StateHalfOpen:
        // Test request succeeded -> close circuit
        slog.Info("Circuit Breaker: Half-Open -> Closed")
        cb.state = StateClosed
        cb.failures = 0

    case StateClosed:
        // Reset consecutive failures
        cb.failures = 0
    }
}

// OnFailure notifies the breaker of a failed call
func (cb *CircuitBreaker) OnFailure() {
    cb.mu.Lock()
    defer cb.mu.Unlock()

    switch cb.state {
    case StateHalfOpen:
        // The test request failed -> go into open state again
        slog.Error("Circuit Breaker: Half-Open -> Open (test failed)")
        cb.state = StateOpen
        cb.openSince = time.Now() // Reset the open timer

    case StateClosed:
        cb.failures++
        slog.Warn("Circuit Breaker: Failure recorded", "count", cb.failures)

        // Check if we've reached the threshold
        if cb.failures >= cb.maxFailures {
            slog.Error("Circuit Breaker: Closed -> Open (threshold reached)")
            cb.state = StateOpen
            cb.openSince = time.Now()
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Now we need to update our request handler to use the circuit breaker

// Handler will hold our client and the circuit breaker for this service
type Handler struct {
    client *http.Client
    cb     *CircuitBreaker
}

// NewHandler creates a new handler
func NewHandler(cb *CircuitBreaker) *Handler {
    return &Handler{
        client: &http.Client{Timeout: 1 * time.Minute},
        cb:     cb,
    }
}

func (h *Handler) MakeRequest() error {
    if err := h.cb.CheckBeforeRequest(); err != nil {
        // Circuit is Open or Half-Open -> fail fast
        slog.Error("Request blocked by circuit breaker",
      slog.String("error", err.Error(),
    )
        return err
    }

    err := h.attemptRequestWithRetry()

    if err != nil {
        // The operation failed after all retries
        h.cb.OnFailure()
    return err
    }

  // The operation succeeded
  h.cb.OnSuccess()

    return nil
}

// Function with retry encapsulated
func (h *Handler) attemptRequestWithRetry() error {
    const MAX_RETRIES = 3
    const BASE_DELAY = 100 * time.Millisecond
    const MAX_JITTER_MS = 100
        idempotencyKey = uuid.NewString()

    r := rand.New(rand.NewSource(time.Now().UnixNano()))

    var lastErr error
    var response *http.Response

    for i := 0; i <= MAX_RETRIES; i++ {
        request, err := http.NewRequest("POST", "http://testing.com", nil)
        if err != nil {
            return fmt.Errorf("error creating the request %v", err)
        }
        request.Header.Add("Idempotency-Key", idempotencyKey)

        // Use the handler's client
        response, err = h.client.Do(request)
        lastErr = err

        if err == nil && response.StatusCode < 500 {
            slog.Info("Request successful", "status", response.Status)
            if response.Body != nil {
                response.Body.Close()
            }
            return nil
        }

        if i == MAX_RETRIES {
            break
        }

        // Calculate backoff and jitter
        backoff := BASE_DELAY * time.Duration(math.Pow(2, float64(i)))
        jitter := time.Duration(r.Intn(MAX_JITTER_MS)) * time.Millisecond
        sleepDuration := backoff + jitter

        // Safely create error message
        errMsg := "server error"
        if err != nil {
            errMsg = err.Error()
        }

        slog.Warn("Request failed, retrying",
            slog.Int("attempt", i+1),
            slog.String("sleep_duration", sleepDuration.String()),
            slog.String("error", err.Error()),
        )

        time.Sleep(sleepDuration)
    }

    // All retries failed
    if lastErr != nil {
        return fmt.Errorf("all retries failed, last network error: %v", lastErr)
    }

    return fmt.Errorf("all retries failed, last status: %s", response.Status)
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

You need to always consider how your system will handle failures when interacting with external services. Remember to consider the tradeoffs, even though it's more likely to have a fault-tolerant application, given its benefits. Please leave a like and a comment about this topic.

Links

Top comments (0)