Kshitij (kd)

Posted on Sep 27, 2023 • Edited on Oct 7, 2023

Resilient Systems using Go: Retry Mechanism

#distributedsystems #go #programming

Introduction

In this post, we will discuss the retry mechanism used to make systems more resilient and create a simple implementation in Go. The idea here is to develop the mechanism from some abstract information

Background

There used to be a time when multiple instances of a monolith were enough to serve users. Applications today are a lot more complex, moving a lot of information and communicating with different other applications to provide users with a service. With a lot of moving parts, it becomes more necessary to make sure your application doesn't break when interacting with third party services. Its better to let the user know that the request cannot be processed rather than make them wait.

That's why we strategically place timeouts. In Go, we do that using the context package. Then the application's logic tells it if it wants to try again or not.

We also dont want to retry too many times in a short period of time. The third party API may rescind all the requests if the number goes beyond the threshold or even block your application's IP from making any requests.
It is a good idea to retry for a fixed amount of time, with some pre-defined time intervals.
The most common way to do it is to retry after every n seconds configured by the user. This value can be obtained with respect to the rate-limiter threshold of the third party API.

One can also apply exponential backoffs. If there is an error, the next request will be done after n seconds, n*n for the request after, and so on.

But do you want the retry mechanism to work when the application makes a bad request?
Or make a request when the third party API tells the system that the service is unavailable for some time.

Design

So our retry package should have these configurations

Max number of retries
Standard duration between each retry
User Defined BackOffs (Exponential or Custom)
Bad Errors When these error occur, we stop retrying
Retry Errors, which is a list of errors If an error outside of the list occurs, we stop retrying

One cannot have both bad and retry errors enabled for our retry functionality at the same time. Similarly, if custom or exponential intervals are given for retry, we should be omitting the variable that sets the maximum number of retries.

Implementation

For the user to run our retry package, they would have to adhere to a function signature. The function call should only return an error.
This can be easily done using closures.
Lets say our retry package has a method called Run
Run(fn func()error)

And we want to call the method Run on our own function, ThirdPartyCall(input string)(string,error)

So a call should look like



obj := retry.New()
obj.Run(func()error {
     resp,err := ThirdPartyCall("input String")
     if err != nil {
          return err 
    }
       // code logic
     return nil 

})

Run Function

For the purpose of this blog, I have not implemented separate functions for a normal retry method and the one with user specified intervals. So we will just do a check on the interval variable. If its length is zero, we will run our function in normal mode, or else we will run using the intervals specified.

These are the few things that we need to keep in mind before the implementation

Have a count of the number of retries done so it can be compared to the threshold.
Put the time gap not at the start of the function but after an error occurs. You don't want the system to wait until the function is called for the first time.
Check for bad errors. If they exist, dont try again
Check for retry errors. If they don't exist, don’t try again.

Another thing to keep in mind is that the whole request, including retries, might take a couple of seconds. So we don’t want the configuration variables to be changed while the retry process is going on. The extra space occupied isn't much, so it won't be a worry.

So the code may look like this



// Run runs the user method wrapped inside Action
// with a set number of retries according to the configuration
func (r *Retrier) Run(fn Action) error {

    if len(r.intervals) > 0 {
        return r.RunWithIntervals(fn)
    }
    var (
        count       int
        badErrors   = r.badErrors
        be          = r.be
        maxRetries  = r.maxRetries
        re          = r.re
        retryErrors = r.retryErrors
        sleep       = r.sleep
    )

    var rn func(fn Action) error
    rn = func(fn Action) error {

        if err := fn(); err != nil {
            if be {
                if _, ok := badErrors[err]; ok {
                    return err
                }
            }

            if re {
                if _, ok := retryErrors[err]; !ok {
                    return err
                }
            }
            count++

            if count > maxRetries {
                return ErrNoResponse
            }
            time.Sleep(sleep)
            return rn(fn)
        }
        return nil
    }

    e := rn(fn)
    return e
}

Run With Intervals

The code will be similar to our run function. As we already keep tabs on the count to compare it to the threshold, the same information can be used to determine how much time we need to sleep before we retry the same function. The code for this would look like



// RunWithIntervals is similar to Run. The difference is that we have a slice
// of time durations corresponding to each retry here, instead of maxRetries
// and constant sleep gap.
func (r *Retrier) RunWithIntervals(fn Action) error {
    var (
        count       int
        badErrors   = r.badErrors
        be          = r.be
        maxRetries  = r.maxRetries
        re          = r.re
        retryErrors = r.retryErrors
        intervals   = r.intervals
    )

    var rn func(fn Action) error
    rn = func(fn Action) error {

        if err := fn(); err != nil {
            if be {
                if _, ok := badErrors[err]; ok {
                    return err
                }
            }

            if re {
                if _, ok := retryErrors[err]; !ok {
                    return err
                }
            }
            count++
            if count >= maxRetries {
                return ErrNoResponse
            }
            time.Sleep(intervals[count])
            return rn(fn)
        }
        return nil
    }

    e := rn(fn)
    return e
}

And that's it. The Retry package is ready to use. Configuration of the package is not in the scope of this post, but it can be found here. I have used the functional options pattern to set the configuration.

You can checkout the package and its test cases here

DEV Community