4 Smart Ways to Manage Retries in Side Projects

#life #productdevelopment #personalprojects

Introduction: The Hidden Teachers of My Own Projects

Side projects, undertaken on my own initiative, have become the greatest learning grounds in my career. These are the moments when I can step away from the processes required by large corporate projects and engage directly with technology. However, this freedom comes with its own set of challenges. One of the most significant is the ability to systematically manage retries, especially in error situations. Over the past decade, I've developed numerous side projects, from my personal financial calculator to a mobile spam blocker. Through the experiences gained, I've tried to establish four fundamental strategies for learning from mistakes and moving forward. In this post, I'll explain these strategies and how I apply them with real-life examples.

These side projects are not just pieces of code for me; they are also a platform for continuous development. While I gain corporate experience working on production ERP systems or internal banking platforms, these personal projects allow me to gain different perspectives and adopt experimental approaches. For instance, the optimizations I make in my personal financial calculators running on my own VPS can sometimes shed light on database performance issues in main projects. In this article, I will share my thoughts on how to manage retries more intelligently, as taught to me by these "hidden teachers."

1. Automate Error Logging: Journald and Logging Strategies

One of the biggest problems I encountered in side projects was the lack of sufficient data to understand why errors were occurring. Especially in one-off scripts or transient services, understanding what was happening in the system at the moment an error occurred is critical. At this point, I started leveraging the power of journald on my Linux systems, which have been using SystemD for a long time.

Journald doesn't just collect logs; it also manages logging levels and rate limits. For my own side projects, I generally follow this strategy: for critical errors, I log with high detail to syslog or a dedicated file, while keeping routine operational logs more concise. This saves disk space while ensuring I have all the necessary information when an error occurs. For example, I had a one-time data extraction script. This script would occasionally receive HTTP 500 Internal Server Error while fetching data from a specific API. Initially, I tried to debug by adding print() statements within the script to understand the error. However, this was slow, and it was difficult to precisely capture the moment the error occurred.

The rate limiting feature offered by Journald also comes into play here. Especially in high-traffic services, logging the same error repeatedly can fill up disk space and make logs unreadable. By adding parameters like RateLimitIntervalSec and RateLimitBurst to my SystemD unit files, I can limit the number of logs from the same source within a specific time frame. This proved useful when analyzing 429 Too Many Requests errors in an API gateway that handled high traffic. The first step in understanding the root cause of errors is to automate how effectively and efficiently we can record them. This not only helps in detecting the error but also speeds up the process of finding its root cause.

ℹ️ Logging and Error Analysis

In my own projects, I monitor logs in real-time using journalctl -f for critical services, while examining more detailed error dumps with journalctl -xe for specific errors. This prevents me from overlooking potential issues.

2. Retry Mechanisms: Ideal Wait Time and Exponential Backoff

When an error occurs, one of the first things that comes to mind is "wait a bit and try again." However, the duration of this "bit" is critical for system stability and user experience. In situations like network errors or temporary service outages, it's important to follow a smart waiting strategy rather than retrying immediately. The concept of "exponential backoff" comes into play here.

In my side projects, I frequently use this mechanism, especially in code that accesses external APIs or services. For instance, in my personal financial calculator project, I occasionally experienced network connection issues when connecting to a data provider's API. When I received an error on the first attempt, instead of retrying the next second, I would wait for 1 second and try again. If I still received an error, I would double the waiting time to 2 seconds, then 4 seconds, 8 seconds, and so on. The biggest advantage of this strategy is that it doesn't overload the target service while giving a reasonable amount of time for a temporary issue to resolve.

Another point I pay attention to when implementing this strategy is the maximum number of retries and the total waiting time. I don't want to get into an infinite loop. Generally, 3 to 5 retries are sufficient. If I still receive an error after these attempts, I understand that it's no longer a temporary issue and requires manual investigation. Another important parameter in this regard is adding "jitter." This means, instead of calculating the exact waiting time, adding a small amount of randomness to the calculated duration. This helps prevent multiple clients from retrying at the same time, thus avoiding the "thundering herd" problem.

💡 The Importance of Jitter

Adding jitter prevents multiple clients from retrying a service simultaneously, thus avoiding sudden load spikes on the service. In my own projects, I create this effect by adding a random duration of 10-20% to the calculated waiting time.

Here's an example of how I implemented this strategy using Astro's fetch API:

async function fetchDataWithRetry(url, maxRetries = 3, initialDelay = 1000) {
  let delay = initialDelay;
  for (let i = 0; i < maxRetries; i++) {
    try {
      const response = await fetch(url);
      if (!response.ok) {
        // Retry logic in case of error
        if (response.status >= 500 && response.status < 600) {
          console.warn(`Request failed with status ${response.status}. Retrying in ${delay}ms...`);
          await new Promise(resolve => setTimeout(resolve, delay));
          delay *= 2; // Exponential backoff
          // Add jitter
          delay += Math.random() * initialDelay;
        } else {
          // Client errors or other non-retryable errors
          throw new Error(`HTTP error! status: ${response.status}`);
        }
      } else {
        // Return data if successful
        return await response.json();
      }
    } catch (error) {
      console.error(`Attempt ${i + 1} failed: ${error.message}`);
      if (i === maxRetries - 1) {
        throw error; // Throw error if it's the last attempt
      }
      await new Promise(resolve => setTimeout(resolve, delay));
      delay *= 2;
      delay += Math.random() * initialDelay;
    }
  }
}

// Usage example:
// fetchDataWithRetry('https://api.example.com/data')
//   .then(data => console.log('Data received:', data))
//   .catch(error => console.error('Failed to fetch data after multiple retries:', error));

This code snippet includes exponential backoff and jitter mechanisms to handle potential network issues when fetching data from an external service. Such simple yet effective strategies ensure my side projects run more stably.

3. State Management and Idempotency: The Side Effects of Retries

One of the most important aspects to consider when implementing retry mechanisms is whether the operation being performed is "idempotent." Idempotency means that applying an operation multiple times has the same effect as applying it once. If your operation is not idempotent, retries can lead to unexpected and undesirable side effects.

For example, consider a function that sends an email to a user. If this function is not idempotent and is retried due to a network error, it might send the same email to the user twice. This is a highly annoying situation from a user experience perspective. In my own projects, especially for sensitive operations like financial transactions or data updates, I use various methods to ensure idempotency.

One method is to use a unique transaction ID for each operation. When I send a request, I also send this ID with the request. On the server side, it checks this ID to see if the operation has been performed before and, if so, does not process the same operation again. For instance, in my Android spam blocker app, when sending a command to block a number, I would generate a unique UUID for each blocking request and send it to the server. If the server had seen this UUID before, it would not process the command to block the same number again. This increased the app's stability and prevented situations like double blocking.

Another approach is to make the operation itself idempotent. For example, instead of incrementing a value, directly setting the value to a specific number (like a SET operation) is a more idempotent approach. Or, instead of deleting an item, marking the item as "deleted." These types of approaches allow retries to be performed safely. Features like INSERT ... ON CONFLICT DO NOTHING or UPSERT in databases like PostgreSQL can also help us in this regard.

⚠️ The Danger of Retrying Without Idempotency

Retrying a non-idempotent operation can lead to data inconsistencies, duplicate records, or serious issues like spamming users. Always consider whether your operation is idempotent.

In this context, the "transaction outbox" pattern is also very useful. When saving an operation to the database, we also write the message to be sent for this operation into an "outbox" table. A separate service then reads messages from this outbox and sends them to external services. This way, if the database operation is successful but the external service message is not received, the message can be resent. This is a simpler and more reliable alternative to two-phase commit.

4. Create a Learning Loop: Learning from Errors and Documentation

One of the most valuable aspects of my side projects is the ability to learn from mistakes and make those lessons permanent. Simply fixing an error is not enough; understanding why it happened, taking the necessary steps to prevent similar errors in the future, and organizing this information makes a big difference in the long run. This is, in a way, performing my own personal "post-mortem" analysis.

For me, the first step in this learning loop is to set up detailed logging and retry mechanisms, as mentioned above. But the next step is to analyze these logs and error situations. In my own side projects, I often keep a "Error Log" using a simple Markdown file or a Notion page. In this log, I record the problem I encountered, when and where the error occurred, the steps I took to resolve it, and most importantly, the lessons I learned from this error.

For example, when I encountered an issue like the WAL (Write-Ahead Logging) files of PostgreSQL bloating in my personal financial calculator projects, I documented this situation in detail. I noted that the size of the pg_wal directory suddenly reached 10GB, filling up disk space. I recorded that I adjusted the pg_wal_keep_size parameter and optimized autovacuum settings to resolve the issue. These notes will guide me if I encounter a similar problem in the future. Such documentation is not only beneficial for me but also for anyone I might share the project with one day.

💡 The Importance of Documentation

Put your lessons learned from errors into writing. This will not only help you resolve the current problem but also help you identify and prevent similar issues faster in the future. The "Error Log" I keep in my own projects is one of my most valuable resources.

To complete this learning loop, I sometimes make improvements to my code. For instance, if an error consistently requires manual intervention, I try to make a small script or code change to automate this situation. This not only solves the current problem but also increases the overall reliability of the project. In my own side projects, this approach has allowed me to build more robust and less maintenance-intensive systems over time.

Conclusion: Errors Are the Fuel for Growth

Side projects offer the freedom to build and experiment on our own. However, encountering errors during these experiments is inevitable. What matters is how we deal with these errors. Systematic error logging, smart retry mechanisms, ensuring idempotency, and creating a loop for learning from errors make this process more efficient and educational. The experiences I've gained in my own projects have made me more prepared for the challenges I face not only in my personal projects but also in corporate projects. Errors are the fuel for growth; as long as we manage them correctly.

The strategies I've described in this article are approaches I've developed based on my personal experiences. Every project may have its own unique dynamics, but the fundamental principles generally remain the same. I hope this information helps you move forward with more solid steps in your own side projects. Remember, every error is a harbinger of a better next step.