Saber Amani

Posted on Jan 12

Real-World Error Handling in Distributed Systems

#softwareengineering #dotnet #systemdesign #cloud

Practical Error Handling in Distributed Systems: What Actually Works

Distributed systems look elegant in architecture diagrams, but error handling is where theory collides with reality. Once you introduce multiple services, cloud functions, queues, retries, and frontends, errors stop being simple exceptions and start becoming workflows of their own. If you have ever had a Lambda fail without surfacing an error, an Azure Function swallow an exception, or a .NET API return a meaningless 500, you already know this pain well.

I have dealt with these issues in production systems across .NET backends, cloud-native workloads, and React frontends. What follows is not a list of best practices pulled from a book. It is a collection of patterns that actually helped, along with mistakes that cost real time, real money, and more than a few late nights.

Why Just Throwing Exceptions Breaks Down Fast

On a single machine, throwing an exception feels reasonable. The stack trace is there, the debugger catches it, and you fix the issue. In distributed systems, that mental model breaks almost immediately.

Once a request crosses process boundaries, context starts disappearing. By the time an exception reaches an API gateway or message broker, the original cause is often gone. Async calls, background queues, and retries further blur the picture, leaving you with a failure that is technically visible but practically useless.

Retries introduce another class of problems. Retrying blindly can turn a small transient issue into a cascading failure. A short database hiccup suddenly becomes a flood of repeated requests that overload the system even further.

Cloud platforms add their own complications. Background jobs, serverless functions, and orchestrators frequently report success while quietly logging errors somewhere nobody is watching. From the platform’s point of view, the job completed. From your point of view, critical logic never ran.

The final casualty is the user. They see a vague error message, support teams cannot trace what happened, and engineers are left digging through logs late at night with no clear starting point.

Patterns I Have Used and Learned From

Returning Structured Error Objects Instead of Bare Status Codes

HTTP status codes alone are not enough in real systems. They tell you something went wrong, but not what or why. Clients need structured information that can be logged, displayed, and correlated across services.

Here is a simplified example from a real .NET API:

public IActionResult GetUser(int id)
{
    var user = _userService.GetUser(id);
    if (user == null)
    {
        return NotFound(new ErrorResponse
        {
            Code = "USER_NOT_FOUND",
            Message = $"User with id {id} does not exist.",
            CorrelationId = HttpContext.TraceIdentifier
        });
    }

    return Ok(user);
}

This approach pays off quickly. Error codes allow frontends and other services to react consistently. Clear messages help users and support teams. Correlation IDs make it possible to trace a failure across logs, queues, and downstream calls.

Idempotency Is a Requirement, Not an Optimization

In distributed systems, retries are unavoidable. Network calls fail. Timeouts happen. Messages get re-delivered. If your system cannot safely handle duplicate requests, you will eventually see data corruption or duplicated side effects.

For APIs, requiring an Idempotency-Key header is one of the simplest safeguards. On the backend, that key must be checked and stored so repeated requests do not re-run the same operation.

For background jobs and message consumers, storing processed message identifiers in a fast store such as Redis, DynamoDB, or a database table with a unique constraint prevents duplicate processing. Skipping this step is how you end up charging customers twice or sending duplicate emails in production.

Logging and Observability Must Be Intentional

Early in my career, I avoided logging too much because it felt noisy. In distributed systems, under-logging is a far bigger problem than over-logging. Without context, logs are nearly useless.

Structured logs make a massive difference. Logging in a machine-readable format allows you to query by correlation ID, user ID, or operation name. Including context such as environment, request identifiers, and key input values turns logs into a diagnostic tool rather than a last resort.

Alerts matter just as much as logs. Logging an error that nobody sees is equivalent to ignoring it. Alerting on patterns such as repeated failures, growing queue backlogs, or unusual spikes gives you time to react before users notice.

Do Not Let the Cloud Hide Your Failures

Cloud platforms optimize for availability, not visibility. Errors often end up buried in dashboards that nobody checks unless something is already broken.

What consistently worked for me was being explicit. Throw exceptions when something truly fails and configure retries with backoff at the orchestrator level. For critical paths, send errors to a shared alerting channel where humans will actually see them.

For batch jobs or background processing, writing failures to a dedicated table or queue creates a paper trail. It allows you to inspect, replay, or manually resolve failed items without guessing what went wrong.

React Frontends Need Real Error Handling Too

Frontend error handling is often treated as cosmetic, but users experience errors first through the UI. Poor error handling creates confusion and destroys trust.

In several React applications, I have seen errors hidden entirely because no error boundaries were in place. In others, users were shown meaningless messages that suggested they did something wrong when the problem was clearly on the backend.

What helped was making errors intentional. Backend error messages were surfaced carefully without leaking internals. Error boundaries caught unexpected failures and displayed something actionable. Retry behavior was explicit so users knew whether trying again made sense or if support needed to be contacted.

Edge Cases That Caused Real Pain

Partial failures are unavoidable in distributed workflows. In saga-style processes, one service can succeed while another fails. Rollbacks are often impossible, so compensating actions and clear logging become essential.

Environment drift is another silent killer. Development, staging, and production often behave differently due to configuration mismatches. Testing error scenarios across environments is tedious but necessary.

AI integrations introduce their own risks. Large language models can time out, return malformed responses, or behave unpredictably. Wrapping these calls with timeouts, circuit breakers, and strict response validation prevents them from becoming a new source of instability.

What I Would Change If Starting Again

I would define a shared error contract across services from the beginning. Retrofitting this later is painful and error-prone.

I would treat every network call as a potential failure, even internal ones. Assuming reliability is how systems fail unexpectedly.

Most importantly, I would invest in log correlation and searchability before the first production incident. It is much harder to add observability after users are already affected.

Practical Takeaways

Design error responses with clear codes, messages, and correlation identifiers instead of relying on raw exceptions.

Make all side-effecting APIs and background jobs idempotent or accept that duplicate processing will happen.

Log errors with context, not just stack traces, and alert on meaningful patterns.

Assume cloud platforms will hide failures unless you make them visible.

In React applications, surface errors honestly and clearly instead of masking them behind generic messages.

How do you handle partial failures and retries in your own distributed systems. What patterns saved you during incidents, and what approaches failed under pressure. I would love to hear your war stories or disagreements.

If you want a C# error response template or a concrete idempotency example, let me know and I can share what has worked for me.

DEV Community