Hossein Esmati

Posted on Jun 26 • Originally published at nova-globen.se

Resilience in .NET: Building Fault-Tolerant Applications - Circuit Breakers

#resilience #circuit #breakers #dotnet

This article is part of the Comprehensive Guide to Microservices Architecture in .NET Core, Cloud and Azure series.

Resilience and Circuit Breakers

Read Implement the Circuit Breaker pattern

Resilience in software architecture refers to an application's ability to recover from failures and continue functioning. In microservices, this often means gracefully handling network issues, service outages, or slow responses.

The Circuit Breaker pattern is a resilience strategy that prevents an application from repeatedly trying operations likely to fail. It "breaks" the circuit after a threshold of failures, temporarily halting requests to give the system time to recover. This protects services from overload and improves overall stability.

Together, resilience techniques like retries and circuit breakers help build robust, fault-tolerant systems that thrive in unpredictable environments.

Centralized Resilience Policies

.NET 9 introduces enhanced Microsoft.Extensions.Resilience:

Read about it here: Build resilient HTTP apps: Key development patterns

Standard Resilience Handler (.NET 9):

// Install packages
// Microsoft.Extensions.Http.Resilience
// Microsoft.Extensions.Resilience

// Configure resilience pipeline with standard handler
services.AddHttpClient<IOrderServiceClient, OrderServiceClient>(client =>
{
    client.BaseAddress = new Uri("http://order-service");
})
.AddStandardResilienceHandler(options =>
{
    // Configure retry with exponential backoff
    options.Retry = new HttpRetryStrategyOptions
    {
        MaxRetryAttempts = 3,
        Delay = TimeSpan.FromSeconds(1),
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true,
        ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
            .Handle<HttpRequestException>()
            .HandleResult(response => 
                response.StatusCode >= HttpStatusCode.InternalServerError ||
                response.StatusCode == HttpStatusCode.RequestTimeout)
    };

    // Configure circuit breaker
    options.CircuitBreaker = new HttpCircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,
        SamplingDuration = TimeSpan.FromSeconds(30),
        MinimumThroughput = 10,
        BreakDuration = TimeSpan.FromSeconds(30),
        ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
            .Handle<HttpRequestException>()
            .HandleResult(response => 
                response.StatusCode >= HttpStatusCode.InternalServerError)
    };

    // Configure timeout
    options.AttemptTimeout = new HttpTimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(3)
    };

    options.TotalRequestTimeout = new HttpTimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(10)
    };
});

Custom Resilience Pipeline with .NET 9 Enhancements:

services.AddResiliencePipeline("my-pipeline", builder =>
{
    // Add retry with exponential backoff and jitter
    builder.AddRetry(new RetryStrategyOptions
    {
        MaxRetryAttempts = 3,
        Delay = TimeSpan.FromSeconds(1),
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true,
        MaxDelay = TimeSpan.FromSeconds(30),
        ShouldHandle = new PredicateBuilder()
            .Handle<HttpRequestException>()
            .Handle<TimeoutException>(),
        OnRetry = args =>
        {
            Console.WriteLine($"Retry attempt {args.AttemptNumber} after {args.RetryDelay}");
            return ValueTask.CompletedTask;
        }
    });

    // Add circuit breaker with state change notifications
    builder.AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,
        SamplingDuration = TimeSpan.FromSeconds(30),
        MinimumThroughput = 10,
        BreakDuration = TimeSpan.FromSeconds(30),
        ShouldHandle = new PredicateBuilder()
            .Handle<HttpRequestException>()
            .Handle<TimeoutException>(),
        OnOpened = args =>
        {
            Console.WriteLine($"Circuit breaker opened at {DateTime.UtcNow}");
            return ValueTask.CompletedTask;
        },
        OnClosed = args =>
        {
            Console.WriteLine($"Circuit breaker closed at {DateTime.UtcNow}");
            return ValueTask.CompletedTask;
        },
        OnHalfOpened = args =>
        {
            Console.WriteLine($"Circuit breaker half-opened at {DateTime.UtcNow}");
            return ValueTask.CompletedTask;
        }
    });

    // Add timeout strategy
    builder.AddTimeout(new TimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(10),
        OnTimeout = args =>
        {
            Console.WriteLine($"Operation timed out after {args.Timeout}");
            return ValueTask.CompletedTask;
        }
    });

    // Add hedging (parallel requests) - useful for read operations
    builder.AddHedging(new HedgingStrategyOptions
    {
        MaxHedgedAttempts = 2,
        Delay = TimeSpan.FromSeconds(1),
        ShouldHandle = new PredicateBuilder()
            .Handle<HttpRequestException>()
            .Handle<TimeoutException>()
    });
});

// Usage with dependency injection
public class OrderService
{
    private readonly ResiliencePipeline _pipeline;
    private readonly HttpClient _httpClient;

    public OrderService(
        ResiliencePipelineProvider<string> pipelineProvider,
        HttpClient httpClient)
    {
        _pipeline = pipelineProvider.GetPipeline("my-pipeline");
        _httpClient = httpClient;
    }

    public async Task<Order> GetOrderAsync(Guid id, CancellationToken cancellationToken = default)
    {
        return await _pipeline.ExecuteAsync(async ct =>
        {
            var response = await _httpClient.GetAsync($"/api/orders/{id}", ct);
            response.EnsureSuccessStatusCode();
            return await response.Content.ReadFromJsonAsync<Order>(cancellationToken: ct) 
                ?? throw new InvalidOperationException("Failed to deserialize order");
        }, cancellationToken);
    }
}

Polly v8 Integration:

Microsoft.Extensions.Resilience (and Microsoft.Extensions.Http.Resilience) are built on top of Polly v8 and provide first-class dependency injection, options pattern integration, built-in telemetry, and ready-made pipelines (hedging, timeouts, retries, circuit-breaker) for .NET applications.

Microsoft deprecated the older HttpClient integration packages (Microsoft.Extensions.Http.Polly and Polly.Extensions.Http) in favor of the new Microsoft.Extensions.Http.Resilience package—but that's an integration change, not Polly being obsolete.

Use Microsoft.Extensions.Http.Resilience for HttpClient scenarios and broader Microsoft.Extensions.Resilience when you want opinionated, preconfigured pipelines. Use Polly v8 directly if you need custom strategies or to apply resilience outside those integrations. (Microsoft for Developers: "Building resilient cloud services with .NET 8")

Legacy Polly Example (for reference):

// Older Polly approach - consider migrating to Microsoft.Extensions.Resilience
services.AddHttpClient<IOrderServiceClient, OrderServiceClient>()
    .AddPolicyHandler(GetRetryPolicy())
    .AddPolicyHandler(GetCircuitBreakerPolicy());

static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
    return HttpPolicyExtensions
        .HandleTransientHttpError()
        .OrResult(msg => msg.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
        .WaitAndRetryAsync(
            retryCount: 3,
            sleepDurationProvider: retryAttempt => 
                TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)),
            onRetry: (outcome, timespan, retryCount, context) =>
            {
                Log.Warning($"Retry {retryCount} after {timespan}");
            });
}

static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
    return HttpPolicyExtensions
        .HandleTransientHttpError()
        .CircuitBreakerAsync(
            handledEventsAllowedBeforeBreaking: 5,
            durationOfBreak: TimeSpan.FromSeconds(30),
            onBreak: (outcome, duration) =>
            {
                Log.Warning($"Circuit breaker opened for {duration}");
            },
            onReset: () =>
            {
                Log.Information("Circuit breaker reset");
            });
}

Azure Resilience Features

Azure Service Bus with Built-in Resilience:

// Azure Service Bus client with automatic retries
var clientOptions = new ServiceBusClientOptions
{
    RetryOptions = new ServiceBusRetryOptions
    {
        Mode = ServiceBusRetryMode.Exponential,
        MaxRetries = 3,
        Delay = TimeSpan.FromSeconds(1),
        MaxDelay = TimeSpan.FromSeconds(30),
        TryTimeout = TimeSpan.FromSeconds(10)
    }
};

var client = new ServiceBusClient(connectionString, clientOptions);

// Combine with resilience pipeline for additional protection
services.AddResiliencePipeline("servicebus-pipeline", builder =>
{
    builder.AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,
        MinimumThroughput = 10,
        BreakDuration = TimeSpan.FromSeconds(30)
    });
});

Azure Storage with Resilience:

// Configure blob client with retry policy
var blobClientOptions = new BlobClientOptions
{
    Retry =
    {
        Mode = RetryMode.Exponential,
        MaxRetries = 3,
        Delay = TimeSpan.FromSeconds(2),
        MaxDelay = TimeSpan.FromSeconds(30),
        NetworkTimeout = TimeSpan.FromSeconds(100)
    }
};

var blobServiceClient = new BlobServiceClient(connectionString, blobClientOptions);

Azure App Configuration for Dynamic Feature Flags:

// Configure Azure App Configuration with resilience
builder.Configuration.AddAzureAppConfiguration(options =>
{
    options.Connect(connectionString)
        .ConfigureRefresh(refresh =>
        {
            refresh.Register("Settings:Sentinel", refreshAll: true)
                .SetCacheExpiration(TimeSpan.FromSeconds(30));
        })
        .UseFeatureFlags(featureFlagOptions =>
        {
            featureFlagOptions.CacheExpirationInterval = TimeSpan.FromSeconds(30);
        });
});

// Add Azure App Configuration middleware
app.UseAzureAppConfiguration();

Graceful Degradation

Graceful degradation is the ability to continue functioning (possibly with reduced capabilities) when parts of the system fail.

1. Fallback Responses with Multiple Tiers:

public class ProductService
{
    private readonly IProductRepository _repository;
    private readonly IDistributedCache _cache;
    private readonly IMemoryCache _memoryCache;
    private readonly ILogger<ProductService> _logger;
    private readonly ResiliencePipeline _pipeline;

    public ProductService(
        IProductRepository repository,
        IDistributedCache cache,
        IMemoryCache memoryCache,
        ILogger<ProductService> logger,
        ResiliencePipelineProvider<string> pipelineProvider)
    {
        _repository = repository;
        _cache = cache;
        _memoryCache = memoryCache;
        _logger = logger;
        _pipeline = pipelineProvider.GetPipeline("product-service");
    }

    public async Task<Product> GetProductAsync(Guid id, CancellationToken cancellationToken = default)
    {
        // Try memory cache first (fastest)
        if (_memoryCache.TryGetValue($"product:{id}", out Product? cachedProduct))
        {
            return cachedProduct!;
        }

        try
        {
            // Try primary data source with resilience
            var product = await _pipeline.ExecuteAsync(
                async ct => await _repository.GetByIdAsync(id, ct),
                cancellationToken);

            // Cache successful result
            _memoryCache.Set($"product:{id}", product, TimeSpan.FromMinutes(5));
            await _cache.SetStringAsync(
                $"product:{id}", 
                JsonSerializer.Serialize(product),
                new DistributedCacheEntryOptions 
                { 
                    AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(1) 
                },
                cancellationToken);

            return product;
        }
        catch (Exception ex)
        {
            _logger.LogWarning(ex, "Primary data source failed for product {ProductId}, trying distributed cache", id);

            try
            {
                // Fallback to distributed cache
                var cached = await _cache.GetStringAsync($"product:{id}", cancellationToken);
                if (cached != null)
                {
                    var product = JsonSerializer.Deserialize<Product>(cached);
                    if (product != null)
                    {
                        // Update memory cache
                        _memoryCache.Set($"product:{id}", product, TimeSpan.FromMinutes(5));
                        return product;
                    }
                }
            }
            catch (Exception cacheEx)
            {
                _logger.LogError(cacheEx, "Distributed cache fallback failed for product {ProductId}", id);
            }

            // Return degraded response
            _logger.LogWarning("All data sources failed for product {ProductId}, returning degraded response", id);
            return new Product
            {
                Id = id,
                Name = "Product information temporarily unavailable",
                Description = "We're experiencing technical difficulties. Please try again later.",
                IsAvailable = false,
                Price = 0
            };
        }
    }
}

2. Feature Management with Azure App Configuration:

Load Shedding and Rate Limiting

Load shedding is the practice of deliberately dropping requests when the system is overloaded to maintain stability for accepted requests.

Applications using rate limiting and load shedding should be carefully load tested and reviewed before deploying.

Enhanced Rate Limiting in .NET 9:

// Rate limiting middleware in ASP.NET Core 9
// https://learn.microsoft.com/en-us/aspnet/core/performance/rate-limit?view=aspnetcore-9.0

services.AddRateLimiter(options =>
{
    // Global rate limiter with user-specific limits
    options.GlobalLimiter = PartitionedRateLimiter.Create<HttpContext, string>(context =>
    {
        var userId = context.User.Identity?.Name ?? 
                     context.Connection.RemoteIpAddress?.ToString() ?? 
                     "anonymous";

        // Premium users get higher limits
        var isPremium = context.User.HasClaim("tier", "premium");
        var permitLimit = isPremium ? 200 : 100;

        return RateLimitPartition.GetFixedWindowLimiter(
            partitionKey: userId,
            factory: partition => new FixedWindowRateLimiterOptions
            {
                PermitLimit = permitLimit,
                Window = TimeSpan.FromMinutes(1),
                QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
                QueueLimit = 10
            });
    });

    // Sliding window for API endpoints
    options.AddPolicy("api-sliding", context =>
    {
        return RateLimitPartition.GetSlidingWindowLimiter(
            partitionKey: context.Connection.RemoteIpAddress?.ToString() ?? "unknown",
            factory: partition => new SlidingWindowRateLimiterOptions
            {
                PermitLimit = 100,
                Window = TimeSpan.FromMinutes(1),
                SegmentsPerWindow = 6,
                QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
                QueueLimit = 5
            });
    });

    // Token bucket for burst handling
    options.AddPolicy("token-bucket", context =>
    {
        return RateLimitPartition.GetTokenBucketLimiter(
            partitionKey: context.User.Identity?.Name ?? "anonymous",
            factory: partition => new TokenBucketRateLimiterOptions
            {
                TokenLimit = 100,
                ReplenishmentPeriod = TimeSpan.FromSeconds(10),
                TokensPerPeriod = 10,
                QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
                QueueLimit = 10
            });
    });

    // Concurrency limiter for resource-intensive operations
    options.AddConcurrencyLimiter("concurrent-requests", options =>
    {
        options.PermitLimit = 50;
        options.QueueLimit = 25;
        options.QueueProcessingOrder = QueueProcessingOrder.OldestFirst;
    });

    options.OnRejected = async (context, token) =>
    {
        context.HttpContext.Response.StatusCode = StatusCodes.Status429TooManyRequests;

        if (context.Lease.TryGetMetadata(MetadataName.RetryAfter, out var retryAfter))
        {
            context.HttpContext.Response.Headers.RetryAfter = retryAfter.TotalSeconds.ToString();
        }

        await context.HttpContext.Response.WriteAsJsonAsync(
            new { error = "Too many requests. Please try again later." }, 
            token);
    };
});

app.UseRateLimiter();

// Per-endpoint rate limiting with chaining
[EnableRateLimiting("api-sliding")]
[HttpGet("orders")]
public async Task<IActionResult> GetOrders()
{
    // Implementation
    return Ok();
}

// Disable rate limiting for specific endpoints
[DisableRateLimiting]
[HttpGet("health")]
public IActionResult Health()
{
    return Ok();
}

Advanced Load Shedding with Metrics:

services.AddResiliencePipeline("api-handler", builder =>
{
    builder.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
    {
        PermitLimit = 100,
        QueueLimit = 50,
        QueueProcessingOrder = QueueProcessingOrder.OldestFirst
    });
});

// Advanced load shedding middleware with health metrics
public class AdaptiveLoadSheddingMiddleware
{
    private readonly RequestDelegate _next;
    private readonly SemaphoreSlim _semaphore;
    private readonly ILogger<AdaptiveLoadSheddingMiddleware> _logger;
    private readonly IMeterFactory _meterFactory;
    private readonly Counter<long> _rejectedRequestsCounter;
    private readonly Counter<long> _acceptedRequestsCounter;
    private int _currentLoad = 0;

    public AdaptiveLoadSheddingMiddleware(
        RequestDelegate next, 
        ILogger<AdaptiveLoadSheddingMiddleware> logger,
        IMeterFactory meterFactory)
    {
        _next = next;
        _semaphore = new SemaphoreSlim(100, 100);
        _logger = logger;
        _meterFactory = meterFactory;

        var meter = _meterFactory.Create("LoadShedding");
        _rejectedRequestsCounter = meter.CreateCounter<long>("requests.rejected");
        _acceptedRequestsCounter = meter.CreateCounter<long>("requests.accepted");
    }

    public async Task InvokeAsync(HttpContext context)
    {
        var currentLoad = Interlocked.Increment(ref _currentLoad);

        try
        {
            // Check if we're at capacity
            if (!await _semaphore.WaitAsync(TimeSpan.Zero))
            {
                _logger.LogWarning("Server overloaded (load: {CurrentLoad}), shedding request", currentLoad);
                _rejectedRequestsCounter.Add(1, new KeyValuePair<string, object?>("reason", "capacity"));

                context.Response.StatusCode = StatusCodes.Status503ServiceUnavailable;
                context.Response.Headers.RetryAfter = "30";
                await context.Response.WriteAsJsonAsync(new 
                { 
                    error = "Service temporarily unavailable",
                    retryAfter = 30
                });
                return;
            }

            _acceptedRequestsCounter.Add(1);

            try
            {
                await _next(context);
            }
            finally
            {
                _semaphore.Release();
            }
        }
        finally
        {
            Interlocked.Decrement(ref _currentLoad);
        }
    }
}

Priority-Based Load Shedding with Quality of Service:

public enum RequestPriority
{
    Critical = 0,
    High = 1,
    Medium = 2,
    Low = 3
}

public class PriorityLoadSheddingMiddleware
{
    private readonly RequestDelegate _next;
    private readonly ILogger<PriorityLoadSheddingMiddleware> _logger;
    private int _currentLoad = 0;
    private const int MaxLoad = 100;

    public async Task InvokeAsync(HttpContext context)
    {
        var priority = GetRequestPriority(context);
        var currentLoad = Interlocked.Increment(ref _currentLoad);
        var loadPercentage = (double)currentLoad / MaxLoad;

        try
        {
            // Shed requests based on load and priority
            if (ShouldShedRequest(loadPercentage, priority))
            {
                _logger.LogWarning(
                    "Shedding {Priority} priority request at {LoadPercentage:P0} load", 
                    priority, 
                    loadPercentage);

                context.Response.StatusCode = StatusCodes.Status503ServiceUnavailable;
                context.Response.Headers.RetryAfter = GetRetryAfterSeconds(priority).ToString();

                await context.Response.WriteAsJsonAsync(new
                {
                    error = "Service busy",
                    priority = priority.ToString(),
                    retryAfter = GetRetryAfterSeconds(priority)
                });
                return;
            }

            await _next(context);
        }
        finally
        {
            Interlocked.Decrement(ref _currentLoad);
        }
    }

    private bool ShouldShedRequest(double loadPercentage, RequestPriority priority)
    {
        return priority switch
        {
            RequestPriority.Low => loadPercentage > 0.6,
            RequestPriority.Medium => loadPercentage > 0.8,
            RequestPriority.High => loadPercentage > 0.95,
            RequestPriority.Critical => false, // Never shed critical requests
            _ => loadPercentage > 0.8
        };
    }

    private RequestPriority GetRequestPriority(HttpContext context)
    {
        // Priority from header
        if (context.Request.Headers.TryGetValue("X-Priority", out var priorityHeader) &&
            Enum.TryParse<RequestPriority>(priorityHeader, out var priority))
        {
            return priority;
        }

        // Priority based on endpoint
        if (context.Request.Path.StartsWithSegments("/api/critical"))
            return RequestPriority.Critical;
        if (context.Request.Path.StartsWithSegments("/api/admin"))
            return RequestPriority.High;
        if (context.Request.Path.StartsWithSegments("/api/reports"))
            return RequestPriority.Low;

        // Priority based on user tier
        if (context.User.HasClaim("tier", "premium"))
            return RequestPriority.High;

        return RequestPriority.Medium;
    }

    private int GetRetryAfterSeconds(RequestPriority priority)
    {
        return priority switch
        {
            RequestPriority.Low => 60,
            RequestPriority.Medium => 30,
            RequestPriority.High => 10,
            RequestPriority.Critical => 5,
            _ => 30
        };
    }
}

Health Checks (Liveness and Readiness)

Read Health checks in ASP.NET Core

What are Health Checks?

Liveness: Is the application running? (Should it be restarted?)
Readiness: Can the application serve traffic? (Should it receive requests?)
Startup: Has the application finished initializing? (Is it ready to start accepting health checks?)

Enhanced Health Checks in .NET 9:

// Add health checks with tags and dependencies
services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy("Application is running"))
    .AddSqlServer(
        connectionString: builder.Configuration.GetConnectionString("DefaultConnection")!,
        name: "database",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "db", "sql", "ready" },
        timeout: TimeSpan.FromSeconds(3))
    .AddAzureServiceBusTopic(
        connectionString: builder.Configuration["ServiceBus:ConnectionString"]!,
        topicName: "order-events",
        name: "servicebus",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "messaging", "ready" })
    .AddRedis(
        redisConnectionString: builder.Configuration["Redis:ConnectionString"]!,
        name: "redis-cache",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "cache", "ready" })
    .AddUrlGroup(
        uri: new Uri("http://customer-service/health/ready"),
        name: "customer-service",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "dependencies", "ready" },
        timeout: TimeSpan.FromSeconds(3))
    .AddAzureBlobStorage(
        connectionString: builder.Configuration["Storage:ConnectionString"]!,
        name: "blob-storage",
        failureStatus: HealthStatus.Degraded,
        tags: new[] { "storage", "ready" })
    .AddCheck<StartupHealthCheck>(
        name: "startup",
        failureStatus: HealthStatus.Unhealthy,
        tags: new[] { "startup" });

// Configure health check endpoints with improved responses
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("self"),
    AllowCachingResponses = false,
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    AllowCachingResponses = false,
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

app.MapHealthChecks("/health/startup", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("startup"),
    AllowCachingResponses = false,
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

// Detailed health check with custom response
app.MapHealthChecks("/health", new HealthCheckOptions
{
    ResponseWriter = async (context, report) =>
    {
        context.Response.ContentType = "application/json";

        var response = new
        {
            status = report.Status.ToString(),
            timestamp = DateTime.UtcNow,
            checks = report.Entries.Select(x => new
            {
                name = x.Key,
                status = x.Value.Status.ToString(),
                duration = x.Value.Duration.TotalMilliseconds,
                description = x.Value.Description,
                data = x.Value.Data,
                exception = x.Value.Exception?.Message,
                tags = x.Value.Tags
            }),
            totalDuration = report.TotalDuration.TotalMilliseconds
        };

        await context.Response.WriteAsJsonAsync(response);
    }
});

Custom Health Checks:

// Startup health check - ensures app is fully initialized
public class StartupHealthCheck : IHealthCheck
{
    private volatile bool _isReady = false;

    public void SetReady() => _isReady = true;

    public Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        if (_isReady)
        {
            return Task.FromResult(HealthCheckResult.Healthy("Application startup complete"));
        }

        return Task.FromResult(HealthCheckResult.Unhealthy("Application is still starting up"));
    }
}

// Azure Service Bus health check with detailed diagnostics
public class ServiceBusHealthCheck : IHealthCheck
{
    private readonly ServiceBusClient _client;
    private readonly string _queueName;
    private readonly ILogger<ServiceBusHealthCheck> _logger;

    public ServiceBusHealthCheck(
        ServiceBusClient client, 
        string queueName,
        ILogger<ServiceBusHealthCheck> logger)
    {
        _client = client;
        _queueName = queueName;
        _logger = logger;
    }

    public async Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        try
        {
            var receiver = _client.CreateReceiver(_queueName);

            // Try to peek a message (non-destructive check)
            await receiver.PeekMessageAsync(cancellationToken: cancellationToken);

            return HealthCheckResult.Healthy(
                "Service Bus is accessible",
                data: new Dictionary<string, object>
                {
                    ["queue"] = _queueName,
                    ["fullyQualifiedNamespace"] = _client.FullyQualifiedNamespace
                });
        }
        catch (ServiceBusException ex) when (ex.Reason == ServiceBusFailureReason.ServiceTimeout)
        {
            _logger.LogWarning(ex, "Service Bus health check timed out");
            return HealthCheckResult.Degraded(
                "Service Bus is slow to respond",
                exception: ex,
                data: new Dictionary<string, object>
                {
                    ["queue"] = _queueName,
                    ["reason"] = ex.Reason.ToString()
                });
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Service Bus health check failed");
            return HealthCheckResult.Unhealthy(
                "Service Bus is not accessible",
                exception: ex,
                data: new Dictionary<string, object>
                {
                    ["queue"] = _queueName
                });
        }
    }
}

// Memory usage health check
public class MemoryHealthCheck : IHealthCheck
{
    private readonly long _threshold;

    public MemoryHealthCheck(long thresholdInBytes = 1024L * 1024L * 1024L) // 1 GB default
    {
        _threshold = thresholdInBytes;
    }

    public Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        var allocated = GC.GetTotalMemory(forceFullCollection: false);
        var data = new Dictionary<string, object>
        {
            ["allocated"] = allocated,
            ["threshold"] = _threshold,
            ["gen0Collections"] = GC.CollectionCount(0),
            ["gen1Collections"] = GC.CollectionCount(1),
            ["gen2Collections"] = GC.CollectionCount(2)
        };

        var status = allocated < _threshold 
            ? HealthStatus.Healthy 
            : HealthStatus.Degraded;

        var description = $"Memory usage: {allocated / 1024.0 / 1024.0:N2} MB";

        return Task.FromResult(new HealthCheckResult(
            status,
            description,
            data: data));
    }
}

Kubernetes Integration with .NET 9:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
        version: v1.0
    spec:
      containers:
      - name: api
        image: order-service:latest
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: ASPNETCORE_ENVIRONMENT
          value: "Production"
        - name: ASPNETCORE_URLS
          value: "http://+:8080"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 30
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1

Azure Container Apps Health Checks:

apiVersion: apps/v1alpha1
kind: ContainerApp
metadata:
  name: order-service
spec:
  configuration:
    ingress:
      external: true
      targetPort: 8080
  template:
    containers:
    - name: order-service
      image: myregistry.azurecr.io/order-service:latest
      resources:
        cpu: 0.5
        memory: 1Gi
      probes:
      - type: Liveness
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 10
        failureThreshold: 3
      - type: Readiness
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 3
      - type: Startup
        httpGet:
          path: /health/startup
          port: 8080
        initialDelaySeconds: 0
        periodSeconds: 10
        failureThreshold: 30

Graceful Shutdown

Enhanced Graceful Shutdown in .NET 9:

public class Program
{
    public static async Task Main(string[] args)
    {
        var builder = WebApplication.CreateBuilder(args);

        // Configure graceful shutdown timeout
        builder.Host.ConfigureHostOptions(options =>
        {
            options.ShutdownTimeout = TimeSpan.FromSeconds(30);
        });

        // Configure Kestrel for graceful shutdown
        builder.WebHost.ConfigureKestrel(options =>
        {
            options.AddServerHeader = false;
            options.Limits.MaxConcurrentConnections = 100;
            options.Limits.MaxConcurrentUpgradedConnections = 100;
        });

        var app = builder.Build();

        // Register shutdown handlers
        var lifetime = app.Services.GetRequiredService<IHostApplicationLifetime>();

        lifetime.ApplicationStopping.Register(() =>
        {
            var logger = app.Services.GetRequiredService<ILogger<Program>>();
            logger.LogInformation("Application is stopping - no new requests will be accepted");

            // Signal health checks that we're shutting down
            var startupCheck = app.Services.GetService<StartupHealthCheck>();
            // Mark as not ready to stop receiving traffic
        });

        lifetime.ApplicationStopped.Register(() =>
        {
            var logger = app.Services.GetRequiredService<ILogger<Program>>();
            logger.LogInformation("Application stopped gracefully");
        });

        await app.RunAsync();
    }
}

// Background service with graceful shutdown
public class OrderProcessor : BackgroundService
{
    private readonly ILogger<OrderProcessor> _logger;
    private readonly IServiceScopeFactory _scopeFactory;
    private readonly Channel<OrderMessage> _channel;

    public OrderProcessor(
        ILogger<OrderProcessor> logger,
        IServiceScopeFactory scopeFactory)
    {
        _logger = logger;
        _scopeFactory = scopeFactory;
        _channel = Channel.CreateUnbounded<OrderMessage>();
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        _logger.LogInformation("Order processor starting");

        await foreach (var message in _channel.Reader.ReadAllAsync(stoppingToken))
        {
            try
            {
                await ProcessOrderAsync(message, stoppingToken);
            }
            catch (OperationCanceledException)
            {
                _logger.LogInformation("Processing canceled, completing in-flight work");
                break;
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Error processing order {OrderId}", message.OrderId);
            }
        }

        _logger.LogInformation("Order processor stopped");
    }

    public override async Task StopAsync(CancellationToken cancellationToken)
    {
        _logger.LogInformation("Stopping order processor gracefully");

        // Signal no more messages
        _channel.Writer.Complete();

        // Wait for in-flight operations to complete (with timeout)
        var completionTask = _channel.Reader.Completion;
        var timeoutTask = Task.Delay(TimeSpan.FromSeconds(25), cancellationToken);

        await Task.WhenAny(completionTask, timeoutTask);

        if (!completionTask.IsCompleted)
        {
            _logger.LogWarning("Order processor shutdown timed out with pending work");
        }

        await base.StopAsync(cancellationToken);
    }

    private async Task ProcessOrderAsync(OrderMessage message, CancellationToken cancellationToken)
    {
        using var scope = _scopeFactory.CreateScope();
        var orderService = scope.ServiceProvider.GetRequiredService<IOrderService>();
        await orderService.ProcessAsync(message, cancellationToken);
    }
}

Graceful Shutdown with Azure Service Bus:

public class ServiceBusProcessor : BackgroundService
{
    private readonly ServiceBusClient _client;
    private readonly ServiceBusProcessor _processor;
    private readonly ILogger<ServiceBusProcessor> _logger;

    public ServiceBusProcessor(
        ServiceBusClient client,
        ILogger<ServiceBusProcessor> logger,
        IConfiguration configuration)
    {
        _client = client;
        _logger = logger;

        var options = new ServiceBusProcessorOptions
        {
            MaxConcurrentCalls = 10,
            AutoCompleteMessages = false,
            MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(5)
        };

        _processor = _client.CreateProcessor(
            configuration["ServiceBus:QueueName"]!,
            options);

        _processor.ProcessMessageAsync += ProcessMessageAsync;
        _processor.ProcessErrorAsync += ProcessErrorAsync;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        _logger.LogInformation("Starting Service Bus processor");
        await _processor.StartProcessingAsync(stoppingToken);

        // Keep running until cancellation is requested
        try
        {
            await Task.Delay(Timeout.Infinite, stoppingToken);
        }
        catch (OperationCanceledException)
        {
            _logger.LogInformation("Shutdown requested");
        }
    }

    public override async Task StopAsync(CancellationToken cancellationToken)
    {
        _logger.LogInformation("Stopping Service Bus processor gracefully");

        // Stop accepting new messages
        await _processor.StopProcessingAsync(cancellationToken);

        // Wait a bit for in-flight messages to complete
        await Task.Delay(TimeSpan.FromSeconds(5), cancellationToken);

        await base.StopAsync(cancellationToken);

        _logger.LogInformation("Service Bus processor stopped");
    }

    private async Task ProcessMessageAsync(ProcessMessageEventArgs args)
    {
        try
        {
            _logger.LogInformation("Processing message {MessageId}", args.Message.MessageId);

            // Process the message
            await ProcessBusinessLogicAsync(args.Message, args.CancellationToken);

            // Complete the message
            await args.CompleteMessageAsync(args.Message, args.CancellationToken);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error processing message {MessageId}", args.Message.MessageId);

            // Abandon message for retry
            await args.AbandonMessageAsync(args.Message);
        }
    }

    private Task ProcessErrorAsync(ProcessErrorEventArgs args)
    {
        _logger.LogError(
            args.Exception,
            "Error in Service Bus processor: {ErrorSource}",
            args.ErrorSource);
        return Task.CompletedTask;
    }

    private async Task ProcessBusinessLogicAsync(
        ServiceBusReceivedMessage message,
        CancellationToken cancellationToken)
    {
        // Business logic here
        await Task.Delay(100, cancellationToken);
    }
}

Observability and Telemetry

OpenTelemetry Integration for Resilience:

// Configure OpenTelemetry with resilience metrics
builder.Services.AddOpenTelemetry()
    .WithMetrics(metrics =>
    {
        metrics
            .AddMeter("Microsoft.Extensions.Resilience")
            .AddMeter("Microsoft.Extensions.Http.Resilience")
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddRuntimeInstrumentation()
            .AddPrometheusExporter();
    })
    .WithTracing(tracing =>
    {
        tracing
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddSource("Microsoft.Extensions.Resilience")
            .AddAzureMonitorTraceExporter(options =>
            {
                options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
            });
    });

// Configure telemetry for resilience pipelines
services.AddResiliencePipeline("monitored-pipeline", (builder, context) =>
{
    var telemetryOptions = new TelemetryOptions
    {
        LoggerFactory = context.ServiceProvider.GetRequiredService<ILoggerFactory>(),
        MeterFactory = context.ServiceProvider.GetRequiredService<IMeterFactory>()
    };

    builder
        .AddRetry(new RetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true
        })
        .AddCircuitBreaker(new CircuitBreakerStrategyOptions
        {
            FailureRatio = 0.5,
            MinimumThroughput = 10,
            BreakDuration = TimeSpan.FromSeconds(30)
        })
        .ConfigureTelemetry(telemetryOptions);
});

Best Practices Summary

Combine Multiple Resilience Patterns: Use retries, circuit breakers, timeouts, and bulkheads together for comprehensive protection.
Configure Appropriate Timeouts: Set realistic timeouts based on your SLAs and use hierarchical timeouts (per-attempt and total).
Implement Health Checks: Use startup, liveness, and readiness probes to enable orchestrators to manage your application effectively.
Monitor and Alert: Instrument your resilience strategies with metrics and logging to understand failure patterns.
Test Failure Scenarios: Use chaos engineering principles to test your resilience strategies under failure conditions.
Graceful Degradation: Ensure your application can provide reduced functionality when dependencies fail.
Load Shedding: Protect your system by rejecting requests when overloaded rather than failing completely.
Azure-Native Features: Leverage built-in resilience features in Azure services like Service Bus, Storage, and App Configuration.
Circuit Breaker State Management: Monitor circuit breaker states and alert on prolonged open states.
Documentation: Document your resilience strategies and failure modes so operations teams understand system behavior during incidents.

DEV Community