DEV Community

Klein Houzin
Klein Houzin

Posted on

🚀 Building Truly Resilient SQL Server Applications in .NET

Introduction

Modern enterprise systems rarely fail because of code — they fail because of transient infrastructure conditions.

Whether you’re running on Azure SQL, SQL Server in Kubernetes, or on-prem clusters, you will inevitably face temporary failures: network drops, throttling, deadlocks, DNS hiccups, or resource contention.
These are not permanent faults — they are part of the system’s natural breathing pattern.

The key is not to panic, but to build resilience.

In .NET and Entity Framework Core, this resilience is achieved through Execution Strategies and configurable retry logic.
But here’s the catch: the default retry logic doesn’t always cover every transient condition you’ll encounter in production.

That’s where extending the transient error numbers becomes crucial.


The Hidden Edge: errorNumbersToAdd

When you configure SQL resiliency in .NET, one of the most powerful yet overlooked parameters is:

errorNumbersToAdd
Enter fullscreen mode Exit fullscreen mode

This parameter allows you to extend the list of SQL error codes that are treated as transient — meaning, errors that are safe to retry automatically.

Out of the box, EF Core’s SqlAzureExecutionStrategy already knows how to handle the common suspects (e.g., 40613, 40197, 40501, etc.).
But in the field, architects often discover additional transient patterns specific to their workloads — deadlocks, resource throttling, or connection resets — that don’t get retried by default.

If your retry logic doesn’t recognize them, it fails unnecessarily — producing false negatives, failed transactions, and unneeded alerts.


The Architecture of Resilience

A well-designed retry mechanism must be:

  • Deterministic — only retry on known transient conditions.
  • Bounded — never retry endlessly.
  • Idempotent — safe to re-execute operations.
  • Transparent — log every retry attempt for observability.

Extending errorNumbersToAdd strengthens that architecture by covering real-world transient faults that your database layer may throw.


The Battle-Tested Transient Error Codes

Here’s a curated list — drawn from real enterprise production systems — of SQL Server and Azure SQL error codes that should absolutely be treated as transient.

These are the codes you’ll want in your retry configuration:

new[]
{
    // Network & Connection Failures
    20,     // Fatal error in current connection
    53,     // Cannot open a connection to SQL Server
    64,     // Network name deleted
    10053,  // Network connection aborted
    10054,  // Connection reset by peer
    10060,  // Network timeout
    11001,  // DNS lookup failure
    17197,  // SSL / encryption issue under load

    // Resource or Throttling
    40197,  // Azure SQL internal transient error
    40501,  // Throttling (Too many requests)
    40549,  // Session terminated due to transaction timeout
    40613,  // Database unavailable
    40615,  // Database failover in progress
    10928,  // Resource limit reached
    10929,  // Resource throttling
    10936,  // Resource governor limit

    // Concurrency
    1205,   // Deadlock victim

    // Login / Session Availability
    233,    // Connection initialization failed
    18401,  // Login failed: database unavailable
    18456,  // Login failed for user (transient in Azure)
}
Enter fullscreen mode Exit fullscreen mode

These numbers cover the real surface of transient volatility — from network oscillations to throttling and failover events.


How to Apply Them in EF Core

In your Program.cs or Startup.cs, configure your DbContext like this:

builder.Services.AddDbContext<AppDbContext>(options =>
    options.UseSqlServer(
        connectionString,
        sqlOptions =>
        {
            sqlOptions.ExecutionStrategy(deps =>
                new SqlAzureExecutionStrategy(
                    dependencies: deps,
                    maxRetryCount: 7,
                    maxRetryDelay: TimeSpan.FromSeconds(10),
                    errorNumbersToAdd: new[]
                    {
                        20, 53, 64, 10053, 10054, 10060, 11001, 17197,
                        40197, 40501, 40549, 40613, 40615,
                        10928, 10929, 10936,
                        1205, 233, 18401, 18456
                    }
                )
            );
        }));
Enter fullscreen mode Exit fullscreen mode

Now, when your application encounters these errors, EF Core automatically retries the operation before surfacing an exception.
That’s production-grade resilience with minimal code.


Advanced Option — Unified Retry Policy

Architects managing both EF and raw ADO.NET calls can centralize retry configuration:

public static class SqlRetryPolicy
{
    public static readonly int[] AdditionalTransientErrors =
    {
        20, 53, 64, 10053, 10054, 10060, 11001, 17197,
        40197, 40501, 40549, 40613, 40615,
        10928, 10929, 10936,
        1205, 233, 18401, 18456
    };

    public static SqlRetryLogicBaseProvider CreateProvider()
    {
        var options = new SqlRetryLogicOption
        {
            NumberOfTries = 7,
            DeltaTime = TimeSpan.FromSeconds(2),
            MaxTimeInterval = TimeSpan.FromSeconds(10),
            AdditionalErrorNumbers = new HashSet<int>(AdditionalTransientErrors)
        };

        return SqlConfigurableRetryFactory.CreateFixedRetryProvider(options);
    }
}
Enter fullscreen mode Exit fullscreen mode

Then apply it to both ADO.NET connections and EF contexts — a single source of truth for retry logic across your stack.


Key Lessons for Architects

  • Know your environment. Azure SQL behaves differently under throttling than on-prem SQL Server. Observe which errors are transient in your setup.
  • Log intelligently. Retrying hides transient faults, but you still need visibility. Instrument retry events through structured logging.
  • Avoid over-retrying. Retrying too aggressively under heavy load can worsen the problem. Respect backoff intervals.
  • Protect idempotency. Only retry operations that can safely be re-executed. Avoid side-effect–heavy business logic in the same transaction scope.

Conclusion

Resilience isn’t just about writing “retry code.” It’s about understanding the system’s failure modes and engineering your software to breathe with them.

By extending errorNumbersToAdd with the right SQL error codes, you move from reactive error handling to architectural fault tolerance.
Your systems stop failing for the wrong reasons — and start recovering automatically from the right ones.

That’s what separates good systems from great architectures.

Top comments (0)