Matthias Bruns

Posted on Mar 14 • Originally published at appetizers.io

Go Error Handling in Distributed Systems: Patterns for Resilient Microservices

#go #microservices #errors #distributedsystems

Distributed systems fail. Networks partition, services go down, and databases become unavailable. The question isn't whether your Go microservices will encounter errors—it's how gracefully they'll handle them when they do.

Traditional error handling approaches that work fine for monolithic applications fall apart in distributed environments. A single failed database connection can cascade through multiple services, turning a minor hiccup into a system-wide outage. That's where advanced error handling patterns become critical for building resilient microservices.

This guide covers the essential patterns every Go backend developer needs to know for handling errors in distributed systems, from circuit breakers to graceful degradation strategies.

The Problem with Basic Error Handling in Distributed Systems

Go's explicit error handling is one of its strengths, but basic patterns like this become problematic in distributed systems:

func GetUserProfile(userID string) (*User, error) {
    user, err := userService.GetUser(userID)
    if err != nil {
        return nil, err
    }

    profile, err := profileService.GetProfile(userID)
    if err != nil {
        return nil, err
    }

    return &User{...}, nil
}

This approach has several issues in a microservices context:

Error propagation without context: Errors bubble up unfiltered, potentially exposing internal architecture details
No retry logic: Temporary network issues cause immediate failures
Cascade failures: One service failure brings down dependent services
Poor observability: No way to trace errors across service boundaries

According to security best practices research, letting errors bubble up unfiltered is particularly dangerous in distributed architectures, as it can expose file paths, library versions, IP addresses, and schema details to unauthorized actors.

Error Wrapping and Context Propagation

The first step toward resilient error handling is adding proper context to errors. Go's errors package provides powerful wrapping capabilities:

package main


    "context"
    "fmt"
    "errors"
)

type ServiceError struct {
    Service   string
    Operation string
    TraceID   string
    Err       error
}

func (e *ServiceError) Error() string {
    return fmt.Sprintf("service=%s operation=%s trace_id=%s: %v", 
        e.Service, e.Operation, e.TraceID, e.Err)
}

func (e *ServiceError) Unwrap() error {
    return e.Err
}

func GetUserProfile(ctx context.Context, userID string) (*User, error) {
    traceID := getTraceID(ctx)

    user, err := userService.GetUser(ctx, userID)
    if err != nil {
        return nil, &ServiceError{
            Service:   "user-service",
            Operation: "GetUser",
            TraceID:   traceID,
            Err:       fmt.Errorf("failed to get user %s: %w", userID, err),
        }
    }

    profile, err := profileService.GetProfile(ctx, userID)
    if err != nil {
        return nil, &ServiceError{
            Service:   "profile-service",
            Operation: "GetProfile",
            TraceID:   traceID,
            Err:       fmt.Errorf("failed to get profile %s: %w", userID, err),
        }
    }

    return &User{...}, nil
}

As highlighted in practical error handling guides, using trace IDs in distributed systems is crucial for linking errors from the same request across service boundaries.

Circuit Breaker Pattern

Circuit breakers prevent cascade failures by stopping requests to failing services temporarily. Here's a robust implementation:

package circuitbreaker


    "context"
    "errors"
    "sync"
    "time"
)

type State int

const (
    StateClosed State = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    maxRequests  uint32
    interval     time.Duration
    timeout      time.Duration
    readyToTrip  func(counts Counts) bool
    onStateChange func(name string, from State, to State)

    mutex      sync.Mutex
    state      State
    generation uint64
    counts     Counts
    expiry     time.Time
}

type Counts struct {
    Requests             uint32
    TotalSuccesses       uint32
    TotalFailures        uint32
    ConsecutiveSuccesses uint32
    ConsecutiveFailures  uint32
}

func NewCircuitBreaker(settings Settings) *CircuitBreaker {
    cb := &CircuitBreaker{
        maxRequests:   settings.MaxRequests,
        interval:      settings.Interval,
        timeout:       settings.Timeout,
        readyToTrip:   settings.ReadyToTrip,
        onStateChange: settings.OnStateChange,
    }

    cb.toNewGeneration(time.Now())
    return cb
}

func (cb *CircuitBreaker) Execute(req func() (interface{}, error)) (interface{}, error) {
    generation, err := cb.beforeRequest()
    if err != nil {
        return nil, err
    }

    defer func() {
        e := recover()
        if e != nil {
            cb.afterRequest(generation, false)
            panic(e)
        }
    }()

    result, err := req()
    cb.afterRequest(generation, err == nil)
    return result, err
}

func (cb *CircuitBreaker) beforeRequest() (uint64, error) {
    cb.mutex.Lock()
    defer cb.mutex.Unlock()

    now := time.Now()
    state, generation := cb.currentState(now)

    if state == StateOpen {
        return generation, errors.New("circuit breaker is open")
    } else if state == StateHalfOpen && cb.counts.Requests >= cb.maxRequests {
        return generation, errors.New("too many requests")
    }

    cb.counts.Requests++
    return generation, nil
}

func (cb *CircuitBreaker) afterRequest(before uint64, success bool) {
    cb.mutex.Lock()
    defer cb.mutex.Unlock()

    now := time.Now()
    state, generation := cb.currentState(now)
    if generation != before {
        return
    }

    if success {
        cb.onSuccess(state, now)
    } else {
        cb.onFailure(state, now)
    }
}

Use the circuit breaker to wrap service calls:

func (s *UserService) GetUser(ctx context.Context, userID string) (*User, error) {
    result, err := s.circuitBreaker.Execute(func() (interface{}, error) {
        return s.client.GetUser(ctx, userID)
    })

    if err != nil {
        return nil, fmt.Errorf("circuit breaker: %w", err)
    }

    return result.(*User), nil
}

Retry Mechanisms with Exponential Backoff

Network programming research shows that implementing proper retry mechanisms helps make applications more resilient and reliable. Here's a sophisticated retry implementation:

package retry


    "context"
    "errors"
    "math"
    "math/rand"
    "time"
)

type Config struct {
    MaxAttempts int
    BaseDelay   time.Duration
    MaxDelay    time.Duration
    Multiplier  float64
    Jitter      bool
    RetryIf     func(error) bool
}

func DefaultConfig() Config {
    return Config{
        MaxAttempts: 3,
        BaseDelay:   100 * time.Millisecond,
        MaxDelay:    30 * time.Second,
        Multiplier:  2.0,
        Jitter:      true,
        RetryIf:     IsRetryableError,
    }
}

func IsRetryableError(err error) bool {
    // Define which errors are worth retrying
    var netErr *net.Error
    if errors.As(err, &netErr) && netErr.Timeout() {
        return true
    }

    // Add more retryable error types
    return false
}

func Do(ctx context.Context, config Config, fn func() error) error {
    var lastErr error

    for attempt := 0; attempt < config.MaxAttempts; attempt++ {
        if attempt > 0 {
            delay := calculateDelay(config, attempt)
            select {
            case <-time.After(delay):
            case <-ctx.Done():
                return ctx.Err()
            }
        }

        err := fn()
        if err == nil {
            return nil
        }

        lastErr = err

        if !config.RetryIf(err) {
            return err
        }

        if attempt == config.MaxAttempts-1 {
            break
        }
    }

    return fmt.Errorf("retry failed after %d attempts: %w", config.MaxAttempts, lastErr)
}

func calculateDelay(config Config, attempt int) time.Duration {
    delay := float64(config.BaseDelay) * math.Pow(config.Multiplier, float64(attempt))

    if delay > float64(config.MaxDelay) {
        delay = float64(config.MaxDelay)
    }

    if config.Jitter {
        // Add ±25% jitter
        jitter := delay * 0.25
        delay += (rand.Float64()*2-1) * jitter
    }

    return time.Duration(delay)
}

Integrate retry logic with service calls:

func (s *UserService) GetUserWithRetry(ctx context.Context, userID string) (*User, error) {
    var user *User

    err := retry.Do(ctx, retry.DefaultConfig(), func() error {
        var err error
        user, err = s.client.GetUser(ctx, userID)
        return err
    })

    return user, err
}

Graceful Degradation Patterns

When services fail, graceful degradation allows your system to continue operating with reduced functionality:

type UserProfileService struct {
    userService    UserService
    profileService ProfileService
    cacheService   CacheService
    circuitBreaker *CircuitBreaker
}

func (s *UserProfileService) GetUserProfile(ctx context.Context, userID string) (*UserProfile, error) {
    profile := &UserProfile{UserID: userID}
    var errors []error

    // Try to get user data with fallback to cache
    user, err := s.getUserWithFallback(ctx, userID)
    if err != nil {
        errors = append(errors, fmt.Errorf("user service: %w", err))
        // Continue with minimal profile
        profile.Name = "Unknown User"
    } else {
        profile.Name = user.Name
        profile.Email = user.Email
    }

    // Try to get profile data with graceful degradation
    profileData, err := s.getProfileWithDegradation(ctx, userID)
    if err != nil {
        errors = append(errors, fmt.Errorf("profile service: %w", err))
        // Set defaults for missing profile data
        profile.Preferences = getDefaultPreferences()
    } else {
        profile.Preferences = profileData.Preferences
        profile.Settings = profileData.Settings
    }

    // Return partial success if we got some data
    if profile.Name != "" {
        if len(errors) > 0 {
            // Log degraded service but don't fail the request
            logDegradedService(ctx, userID, errors)
        }
        return profile, nil
    }

    // Complete failure
    return nil, fmt.Errorf("unable to build user profile: %v", errors)
}

func (s *UserProfileService) getUserWithFallback(ctx context.Context, userID string) (*User, error) {
    // Try primary service first
    user, err := s.userService.GetUser(ctx, userID)
    if err == nil {
        return user, nil
    }

    // Check if circuit breaker is open or service is down
    if isServiceUnavailable(err) {
        // Try cache as fallback
        cached, cacheErr := s.cacheService.GetUser(ctx, userID)
        if cacheErr == nil {
            return cached, nil
        }
    }

    return nil, err
}

func (s *UserProfileService) getProfileWithDegradation(ctx context.Context, userID string) (*Profile, error) {
    // Set shorter timeout for non-critical data
    degradedCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
    defer cancel()

    profile, err := s.profileService.GetProfile(degradedCtx, userID)
    if err != nil {
        // Don't fail hard on profile service issues
        return nil, fmt.Errorf("profile unavailable (degraded): %w", err)
    }

    return profile, nil
}

Error Observability and Monitoring

Proper error tracking is crucial for distributed systems. Implement structured error logging with metrics:

package monitoring


    "context"
    "log/slog"
    "time"
)

type ErrorTracker struct {
    logger  *slog.Logger
    metrics MetricsCollector
}

type ErrorMetadata struct {
    Service     string
    Operation   string
    ErrorType   string
    TraceID     string
    UserID      string
    Duration    time.Duration
    Retryable   bool
}

func (et *ErrorTracker) TrackError(ctx context.Context, err error, metadata ErrorMetadata) {
    // Structured logging
    et.logger.ErrorContext(ctx, "service error",
        slog.String("service", metadata.Service),
        slog.String("operation", metadata.Operation),
        slog.String("error_type", metadata.ErrorType),
        slog.String("trace_id", metadata.TraceID),
        slog.String("user_id", metadata.UserID),
        slog.Duration("duration", metadata.Duration),
        slog.Bool("retryable", metadata.Retryable),
        slog.String("error", err.Error()),
    )

    // Metrics collection
    et.metrics.IncrementCounter("errors_total", map[string]string{
        "service":    metadata.Service,
        "operation":  metadata.Operation,
        "error_type": metadata.ErrorType,
        "retryable":  fmt.Sprintf("%t", metadata.Retryable),
    })

    et.metrics.RecordDuration("error_duration", metadata.Duration, map[string]string{
        "service":   metadata.Service,
        "operation": metadata.Operation,
    })
}

func (et *ErrorTracker) TrackRecovery(ctx context.Context, metadata ErrorMetadata) {
    et.logger.InfoContext(ctx, "service recovered",
        slog.String("service", metadata.Service),
        slog.String("operation", metadata.Operation),
        slog.String("trace_id", metadata.TraceID),
    )

    et.metrics.IncrementCounter("recoveries_total", map[string]string{
        "service":   metadata.Service,
        "operation": metadata.Operation,
    })
}

Testing Error Scenarios

Test your error handling patterns thoroughly:

func TestCircuitBreakerFailure(t *testing.T) {
    failingService := &MockUserService{
        shouldFail: true,
    }

    cb := circuitbreaker.NewCircuitBreaker(circuitbreaker.Settings{
        MaxRequests: 3,
        Interval:    time.Second,
        Timeout:     time.Second,
        ReadyToTrip: func(counts circuitbreaker.Counts) bool {
            return counts.ConsecutiveFailures >= 2
        },
    })

    service := &UserService{
        client:         failingService,
        circuitBreaker: cb,
    }

    // First two requests should fail and trip the circuit
    for i := 0; i < 2; i++ {
        _, err := service.GetUser(context.Background(), "user123")
        assert.Error(t, err)
    }

    // Third request should fail immediately due to open circuit
    _, err := service.GetUser(context.Background(), "user123")
    assert.Error(t, err)
    assert.Contains(t, err.Error(), "circuit breaker is open")
}

func TestGracefulDegradation(t *testing.T) {
    tests := []struct {
        name           string
        userServiceErr error
        profileErr     error
        expectedName   string
        shouldSucceed  bool
    }{
        {
            name:          "both services working",
            expectedName:  "John Doe",
            shouldSucceed: true,
        },
        {
            name:           "profile service down",
            profileErr:     errors.New("service unavailable"),
            expectedName:   "John Doe",
            shouldSucceed:  true,
        },
        {
            name:           "user service down",
            userServiceErr: errors.New("service unavailable"),
            expectedName:   "Unknown User",
            shouldSucceed:  true,
        },
        {
            name:           "both services down",
            userServiceErr: errors.New("service unavailable"),
            profileErr:     errors.New("service unavailable"),
            shouldSucceed:  false,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // Test implementation
        })
    }
}

Performance Considerations

Error handling patterns add overhead, so monitor their performance impact:

type PerformanceAwareRetry struct {
    config    retry.Config
    metrics   MetricsCollector
    threshold time.Duration
}

func (par *PerformanceAwareRetry) Do(ctx context.Context, fn func() error) error {
    start := time.Now()

    err := retry.Do(ctx, par.config, fn)

    duration := time.Since(start)
    par.metrics.RecordDuration("retry_duration", duration, map[string]string{
        "success": fmt.Sprintf("%t", err == nil),
    })

    // Alert if retries are taking too long
    if duration > par.threshold {
        par.metrics.IncrementCounter("slow_retries", nil)
    }

    return err
}

Best Practices for Go Microservices

Based on debugging research for distributed systems, knowing where an error originated and how it propagated through your code is invaluable. Follow these practices:

Always wrap errors with context: Include service name, operation, and trace IDs
Implement circuit breakers for external dependencies: Prevent cascade failures
Use exponential backoff with jitter: Avoid thundering herd problems
Design for graceful degradation: Identify which features are essential vs. nice-to-have
Monitor error rates and patterns: Set up alerts for unusual error spikes
Test failure scenarios: Include chaos engineering in your testing strategy
Sanitize errors before exposing them: Never leak internal details to external clients

The combination of these patterns creates a resilient microservices architecture that can handle the inevitable failures in distributed systems while maintaining good user experience and system stability.

Remember that error handling isn't just about preventing crashes—it's about building systems that fail gracefully and recover quickly. In the world of Go backend development and microservices architecture, these patterns are essential tools for creating production-ready systems that can withstand the challenges of distributed computing.

DEV Community