Introduction
Modern enterprise systems rarely fail because of code — they fail because of transient infrastructure conditions.
Whether you’re running on Azure SQL, SQL Server in Kubernetes, or on-prem clusters, you will inevitably face temporary failures: network drops, throttling, deadlocks, DNS hiccups, or resource contention.
These are not permanent faults — they are part of the system’s natural breathing pattern.
The key is not to panic, but to build resilience.
In .NET and Entity Framework Core, this resilience is achieved through Execution Strategies and configurable retry logic.
But here’s the catch: the default retry logic doesn’t always cover every transient condition you’ll encounter in production.
That’s where extending the transient error numbers becomes crucial.
The Hidden Edge: errorNumbersToAdd
When you configure SQL resiliency in .NET, one of the most powerful yet overlooked parameters is:
errorNumbersToAdd
This parameter allows you to extend the list of SQL error codes that are treated as transient — meaning, errors that are safe to retry automatically.
Out of the box, EF Core’s SqlAzureExecutionStrategy
already knows how to handle the common suspects (e.g., 40613
, 40197
, 40501
, etc.).
But in the field, architects often discover additional transient patterns specific to their workloads — deadlocks, resource throttling, or connection resets — that don’t get retried by default.
If your retry logic doesn’t recognize them, it fails unnecessarily — producing false negatives, failed transactions, and unneeded alerts.
The Architecture of Resilience
A well-designed retry mechanism must be:
- Deterministic — only retry on known transient conditions.
- Bounded — never retry endlessly.
- Idempotent — safe to re-execute operations.
- Transparent — log every retry attempt for observability.
Extending errorNumbersToAdd
strengthens that architecture by covering real-world transient faults that your database layer may throw.
The Battle-Tested Transient Error Codes
Here’s a curated list — drawn from real enterprise production systems — of SQL Server and Azure SQL error codes that should absolutely be treated as transient.
These are the codes you’ll want in your retry configuration:
new[]
{
// Network & Connection Failures
20, // Fatal error in current connection
53, // Cannot open a connection to SQL Server
64, // Network name deleted
10053, // Network connection aborted
10054, // Connection reset by peer
10060, // Network timeout
11001, // DNS lookup failure
17197, // SSL / encryption issue under load
// Resource or Throttling
40197, // Azure SQL internal transient error
40501, // Throttling (Too many requests)
40549, // Session terminated due to transaction timeout
40613, // Database unavailable
40615, // Database failover in progress
10928, // Resource limit reached
10929, // Resource throttling
10936, // Resource governor limit
// Concurrency
1205, // Deadlock victim
// Login / Session Availability
233, // Connection initialization failed
18401, // Login failed: database unavailable
18456, // Login failed for user (transient in Azure)
}
These numbers cover the real surface of transient volatility — from network oscillations to throttling and failover events.
How to Apply Them in EF Core
In your Program.cs
or Startup.cs
, configure your DbContext like this:
builder.Services.AddDbContext<AppDbContext>(options =>
options.UseSqlServer(
connectionString,
sqlOptions =>
{
sqlOptions.ExecutionStrategy(deps =>
new SqlAzureExecutionStrategy(
dependencies: deps,
maxRetryCount: 7,
maxRetryDelay: TimeSpan.FromSeconds(10),
errorNumbersToAdd: new[]
{
20, 53, 64, 10053, 10054, 10060, 11001, 17197,
40197, 40501, 40549, 40613, 40615,
10928, 10929, 10936,
1205, 233, 18401, 18456
}
)
);
}));
Now, when your application encounters these errors, EF Core automatically retries the operation before surfacing an exception.
That’s production-grade resilience with minimal code.
Advanced Option — Unified Retry Policy
Architects managing both EF and raw ADO.NET calls can centralize retry configuration:
public static class SqlRetryPolicy
{
public static readonly int[] AdditionalTransientErrors =
{
20, 53, 64, 10053, 10054, 10060, 11001, 17197,
40197, 40501, 40549, 40613, 40615,
10928, 10929, 10936,
1205, 233, 18401, 18456
};
public static SqlRetryLogicBaseProvider CreateProvider()
{
var options = new SqlRetryLogicOption
{
NumberOfTries = 7,
DeltaTime = TimeSpan.FromSeconds(2),
MaxTimeInterval = TimeSpan.FromSeconds(10),
AdditionalErrorNumbers = new HashSet<int>(AdditionalTransientErrors)
};
return SqlConfigurableRetryFactory.CreateFixedRetryProvider(options);
}
}
Then apply it to both ADO.NET connections and EF contexts — a single source of truth for retry logic across your stack.
Key Lessons for Architects
- Know your environment. Azure SQL behaves differently under throttling than on-prem SQL Server. Observe which errors are transient in your setup.
- Log intelligently. Retrying hides transient faults, but you still need visibility. Instrument retry events through structured logging.
- Avoid over-retrying. Retrying too aggressively under heavy load can worsen the problem. Respect backoff intervals.
- Protect idempotency. Only retry operations that can safely be re-executed. Avoid side-effect–heavy business logic in the same transaction scope.
Conclusion
Resilience isn’t just about writing “retry code.” It’s about understanding the system’s failure modes and engineering your software to breathe with them.
By extending errorNumbersToAdd
with the right SQL error codes, you move from reactive error handling to architectural fault tolerance.
Your systems stop failing for the wrong reasons — and start recovering automatically from the right ones.
That’s what separates good systems from great architectures.
Top comments (0)