Retries in Distributed Systems: My Observations

#life #distributedsystems #resilience #architecture

Since I started working with distributed systems, I've learned to assume that something will always go wrong, rather than expecting everything to work perfectly "the first time." Especially when managing supply chain integrations or financial transactions in a production ERP, operations inevitably get interrupted due to network outages, server overload, or temporary database locks. In such moments, "retry mechanisms" cease to be just a technical detail and become a philosophy that directly impacts the system's overall resilience and even my own mental well-being.

Like life itself, distributed systems are full of uncertainties. Each microservice, each network hop, each database call can be a source of failure on its own. Therefore, when we initiate an operation, we cannot guarantee its success. In this post, I will share my observations from twenty years of experience, explaining how we handle these inevitable failures, which retry approaches we use, and what these technical details have taught me about life.

Introduction: The Inevitability of Retries in Distributed Systems

There's a scenario I've encountered many times: one service calls another, and the call fails due to a momentary network congestion. If this call involves, for example, updating the status of a production order, accepting the failure can lead to serious business disruptions. While it might seem sufficient at first to catch and log the error with a simple try-catch block, this only postpones the problem or creates a bigger crisis.

⚠️ The Cost of Wrong Assumptions

Assuming systems will always work perfectly, especially under heavy load or in complex integrations, can lead to major disasters. Treating transient errors as permanent causes unnecessary manual interventions and business loss.

In my practice, particularly on a client project, I experienced this situation repeatedly with the ERP's integrations with external APIs. A call to a third-party payment provider failing due to a one-second delay would cause the order to remain pending, leading to decreased customer satisfaction. I realized that we needed to develop a retry strategy that acknowledges the transient errors inherent in the system's nature. Simply put, the "wait and retry" approach for an operation that fails on the first attempt is often the most practical way to get things back on track.

import requests
import time

def make_api_call_simple(url, data, retries=3):
    for i in range(retries):
        try:
            response = requests.post(url, json=data, timeout=5)
            response.raise_for_status() # Catch HTTP 4xx/5xx errors
            print(f"Attempt {i+1}: Successful.")
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"Attempt {i+1} failed: {e}")
            if i < retries - 1:
                time.sleep(1) # Simple wait
    raise Exception(f"All retries failed: {url}")

# Example usage
try:
    result = make_api_call_simple("http://example.com/api/order", {"item": "widget"})
    print("API call completed successfully:", result)
except Exception as e:
    print("API call failed:", e)

Even this simple example shows how important it is to be more resilient to errors rather than trying an operation only once. However, this is not enough; there are more sophisticated approaches.

Beyond Simple Retries: Delayed Approaches

Simple time.sleep(1) approaches don't work, especially in heavily loaded systems or when a resource is truly overloaded. In fact, they can make the situation worse. All clients retrying at the same time can lead to a problem known as the "thundering herd," risking the complete collapse of an already struggling service. Therefore, the concepts of "backoff" and "jitter" are vital in retry strategies.

In my production ERP, I saw this clearly, especially during peak reporting periods when database calls were made. When hundreds of users simultaneously pulled complex reports, the database could temporarily lock up or queries would slow down. If every failed call retried immediately, the database would become completely unusable. The solution was to implement exponential backoff.

Exponential backoff increases the waiting time exponentially after each failed attempt (e.g., 1 second, 2 seconds, 4 seconds, 8 seconds...). This gives the service time to recover. Jitter adds a random amount to this delay, preventing all clients from retrying at the exact same moment. In the backend of my own product, I actively use this combination for calls to external APIs, and I see the system running much more robustly.

import requests
import time
import random

def make_api_call_with_backoff(url, data, retries=5, initial_delay=0.5, max_delay=30):
    for i in range(retries):
        try:
            response = requests.post(url, json=data, timeout=10)
            response.raise_for_status()
            print(f"Attempt {i+1}: Successful.")
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"Attempt {i+1} failed: {e}")
            if i < retries - 1:
                delay = min(initial_delay * (2 ** i), max_delay)
                jitter = random.uniform(0, delay * 0.1) # Up to 10% jitter
                total_delay = delay + jitter
                print(f"  Waiting for {total_delay:.2f} seconds...")
                time.sleep(total_delay)
    raise Exception(f"All retries failed: {url}")

# Example usage (assuming an endpoint that returns HTTP 500 errors here)
# result = make_api_call_with_backoff("http://example.com/api/failing_endpoint", {"param": "value"})

This approach has also taught me a lesson in life: while persistence is good in some situations, sometimes stopping and waiting, calming the situation, and then retrying intelligently yields more efficient results. Patience is a critical value, both in systems and in human relationships.

The Importance and Side Effects of Idempotence

One of the biggest issues we overlook or don't think enough about when designing retry mechanisms is the concept of "idempotence." An idempotent operation has the same effect on the system even if called multiple times. That is, there's no difference in outcome between performing an operation once and performing it ten times. For example, adding 100 units to a user's balance is not idempotent (the balance increases with each attempt), but setting a user's balance to 100 units or creating a money transfer record with a specific transaction_id can be idempotent.

During my time working on an internal banking platform, I saw firsthand how critical idempotence was for money transfer operations. If a transfer operation failed due to a network error and the client retried, we would attempt to create a new record with the same transaction_id. Since the database had a unique constraint for this transaction_id, the second attempt would automatically fail, preventing catastrophic scenarios like double billing.

-- Table structure for an idempotent transfer record in PostgreSQL
CREATE TABLE transfers (
    id SERIAL PRIMARY KEY,
    transaction_id UUID UNIQUE NOT NULL, -- This is important!
    from_account_id INT NOT NULL,
    to_account_id INT NOT NULL,
    amount DECIMAL(18, 2) NOT NULL,
    status VARCHAR(50) NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Inserting a transfer record (will error if transaction_id already exists)
INSERT INTO transfers (transaction_id, from_account_id, to_account_id, amount)
VALUES ('a1b2c3d4-e5f6-7890-1234-567890abcdef', 101, 202, 500.00);

Retry strategies for non-idempotent operations are much more complex. In such cases, approaches like the transaction outbox pattern or using event-sourcing are necessary to reliably track the operation's status and only retry unconfirmed operations. In my own product, when designing mechanisms to prevent users from triggering a specific action multiple times, I always keep this principle in mind. Experiencing data inconsistencies due to accidental retries can turn the debugging process into a nightmare. This teaches me that in life, too, we must think carefully about whether our actions are reversible. Some mistakes will yield the same bad outcome no matter how many times they are retried.

Protecting the System with Circuit Breakers and Rate Limiters

While making a system resilient with retries, another important point is knowing "when to stop." If a service is completely down or overloaded, sending hundreds or even thousands of retry requests to it will only worsen the situation. This is where "Circuit Breaker" and "Rate Limiter" patterns come into play.

Last year, in a manufacturing company's ERP, the queues in our system became bloated when notifications to a logistics company's API started failing continuously. The reason was that the logistics company's API was temporarily out of service. Our system, even with backoff, kept making failed calls. This situation prevented the backlog in our system from resolving, even after the logistics company's API came back online.

ℹ️ How a Circuit Breaker Works

A circuit breaker stops calls when a certain error rate is reached. In the 'Closed' state, it allows calls through. When the error rate increases, it transitions to the 'Open' state and responds to all calls directly with failure. After a certain period, it transitions to the 'Half-Open' state, allowing a few calls. If these calls are successful, it returns to the 'Closed' state; if they fail, it remains 'Open.'

To resolve this scenario, we implemented the Circuit Breaker pattern in our external API calls. If more than 50% of calls failed within a specific time frame (e.g., 60 seconds), the circuit breaker would go into the "open" state, and all subsequent calls would be rejected directly. This prevented wasted resources in our own system and also prevented the remote service from being overloaded further. After a while (e.g., 5 minutes), the circuit breaker would transition to the "half-open" state, allowing a few test calls. If these tests were successful, the circuit would close again.

Rate limiting, on the other hand, limits the number of calls that can be made to a service within a specific period. I use rate limiting for API calls to third-party AI models in my own product's AI integrations. This helps me keep costs under control and comply with API providers' usage policies. These two patterns remind me that in life, too, we must know our own limits, sometimes rest, and not push others too hard.

The Human Factor and Monitoring: When to Intervene?

No matter how good retry mechanisms and protective patterns are, systems are never 100% autonomous. The human factor, meaning the intervention of me and my team, sometimes becomes unavoidable. This is where robust monitoring and alerting systems come into play. Knowing when a system is running "well," when it's running "okay," and when it's in a "disaster" state is crucial for making the right decisions at the right time.

Once, due to a WAL bloat problem in our PostgreSQL database, disk usage increased unexpectedly. We were seeing occasional FATAL errors in the journald logs, but thanks to the pgbouncer and application-level retry mechanisms, users weren't directly experiencing an outage. However, when the disk usage reached 98% at 03:14 AM, an automatic alarm went off. I woke up at 03:20 AM and intervened. If we had relied solely on retry mechanisms without monitoring, the system would have completely crashed by morning.

# An example from journalctl output
May 19 03:14:23 server-prod systemd[1]: postgresql@14-main.service: Main process exited, code=exited, status=1/FAILURE
May 19 03:14:23 server-prod systemd[1]: postgresql@14-main.service: Failed with result 'exit-code'.
May 19 03:14:24 server-prod systemd[1]: postgresql@14-main.service: Service hold-off time over, scheduling restart.
May 19 03:14:24 server-prod systemd[1]: postgresql@14-main.service: State 'auto-restart' is still active.

Events like these taught me that I need to monitor not only technical systems but also my own workload and mental health. Just like a system, a person can handle things up to a certain stress threshold, but beyond a certain point, intervention or rest is necessary. Automatic retries absorb the initial shock while giving me time to find and fix the root cause of the actual problem. This, as I mentioned in the [related: my experience solving system performance issues] post, is a continuous cycle of observation and learning. Monitoring is not just a tool; it's a communication tool—it's the language the system uses to talk to us.

Conclusion: The Art of Resilience in Life and Systems

The observations I've made on retries in distributed systems have taught me important lessons, not just in technical matters but in many areas of life. Accepting that not everything will be perfect on the first try, viewing failures as learning opportunities, and showing flexibility make both systems and our own lives more resilient.

When designing complex workflows in a production ERP, setting up a network infrastructure, or developing my own products, I've always operated with these principles:

Be Prepared for the Unexpected: Always consider the possibility that things can go wrong.
Be Patient: Instead of giving up immediately, retry with the right delay.
Retry Smartly: Use approaches like exponential backoff and jitter to wait for the right moment without exhausting the system or myself.
Know Your Limits: Protect our own capacity and the capacity of other systems with Circuit Breakers and Rate Limiters.
Observe and Intervene: Detect problems early with monitoring and take an active role when necessary.

These approaches can guide us not only in lines of code or network configurations but also when facing challenges in our daily lives. When we fail in a project, have a disagreement in a relationship, or struggle to learn a new skill, this "retry" philosophy reminds us to be resilient, not to give up, and to get a little better with each attempt. Ultimately, both systems and people develop by learning from mistakes and showing flexibility. In my next post, [related: complex database optimizations], I will explain how we achieve this flexibility at the database layer.