Kenta Goto for AWS Community Builders

Posted on Aug 22, 2023 • Edited on Sep 7, 2023

Collection of retry patterns for SDK calls in AWS SDK for Go v2

#aws #go

When using the AWS SDK (AWS SDK for Go v2) in Golang, you may want to retry the SDK call.

There are several patterns of how to do this, so I wrote this.

Assumptions

The example in this article uses v1.18.7 Go and v1.17.5 aws-sdk-go-v2 version.

Also, regarding retries, it is possible to centrally configure retries for all SDK calls (set at client instance creation), but this article assumes that you want to set/change retries for each SDK call (=API call).

e.g.) Changing retry behavior between s3.ListObjects and iam.DeleteRole.

Repository

The source code for this project is available on GitHub.

[Retry Patterns 1.] Options.RetryMaxAttempts

Here is the simplest one first.

In the Options structure used for Client generation and API calls in the AWS SDK, there are parameters for retries called RetryMaxAttempts and RetryMode as shown below.

Implementation Example

input := &iam.DeleteRoleInput{
    RoleName: roleName,
}

optFn := func(o *iam.Options) {
    o.RetryMaxAttempts = 3
    o.RetryMode = aws.RetryModeStandard
}

_, err := i.client.DeleteRole(ctx, input, optFn)

Simply specifying these will cause exponential backoff retries to be performed up to the number of times specified in RetryMaxAttempts.

[Retry Patterns 2.] Options.Retryer

In addition, Options has a parameter called Retryer to implement a fine-tuned retry algorithm.

If this parameter is specified, the retry behavior specified (implemented) here will be applied, instead of RetryMaxAttempts and RetryMode listed above.

The Options.Retryer should be an interface called Retryer or RetryerV2.

There are functions to adjust the retry decision logic (IsErrorRetryable), the maximum number of attempts (MaxAttempts), and the sleep time (RetryDelay), which allow you to customize the retry behavior.

Specifically, IsErrorRetryable allows you to specify in more detail "when to retry". The RetryDelay allows you to set up logic like "wait a random number of seconds (Jitter), not just an exponential backoff".

Code in SDK module (not example implementation)

type Retryer interface {
    // IsErrorRetryable returns if the failed attempt is retryable. This check
    // should determine if the error can be retried, or if the error is
    // terminal.
    IsErrorRetryable(error) bool

    // MaxAttempts returns the maximum number of attempts that can be made for
    // an attempt before failing. A value of 0 implies that the attempt should
    // be retried until it succeeds if the errors are retryable.
    MaxAttempts() int

    // RetryDelay returns the delay that should be used before retrying the
    // attempt. Will return error if the if the delay could not be determined.
    RetryDelay(attempt int, opErr error) (time.Duration, error)

    // GetRetryToken attempts to deduct the retry cost from the retry token pool.
    // Returning the token release function, or error.
    GetRetryToken(ctx context.Context, opErr error) (releaseToken func(error) error, err error)

    // GetInitialToken returns the initial attempt token that can increment the
    // retry token pool if the attempt is successful.
    GetInitialToken() (releaseToken func(error) error)
}

// RetryerV2 is an interface to determine if a given error from an attempt
// should be retried, and if so what backoff delay to apply. The default
// implementation used by most services is the retry package's Standard type.
// Which contains basic retry logic using exponential backoff.
//
// RetryerV2 replaces the Retryer interface, deprecating the GetInitialToken
// method in favor of GetAttemptToken which takes a context, and can return an error.
//
// The SDK's retry package's Attempt middleware, and utilities will always
// wrap a Retryer as a RetryerV2. Delegating to GetInitialToken, only if
// GetAttemptToken is not implemented.
type RetryerV2 interface {
    Retryer

    // GetInitialToken returns the initial attempt token that can increment the
    // retry token pool if the attempt is successful.
    //
    // Deprecated: This method does not provide a way to block using Context,
    // nor can it return an error. Use RetryerV2, and GetAttemptToken instead.
    GetInitialToken() (releaseToken func(error) error)

    // GetAttemptToken returns the send token that can be used to rate limit
    // attempt calls. Will be used by the SDK's retry package's Attempt
    // middleware to get a send token prior to calling the temp and releasing
    // the send token after the attempt has been made.
    GetAttemptToken(context.Context) (func(error) error, error)
}

If you use this method, define a structure named Retryer in a separate file.

The retry decision logic functions that should be implemented in the above IsErrorRetryable can be defined by the caller and passed to the constructor for general use.

In the RetryDelay in the middle of the example below, logic is written to "retry in a random number of seconds".

Example implementation (retryer_options.go)

package retryer

import (
    "context"
    "math/rand"
    "time"

    "github.com/aws/aws-sdk-go-v2/aws"
)

const MaxRetryCount = 10

var _ aws.RetryerV2 = (*Retryer)(nil)

type Retryer struct {
    isErrorRetryableFunc func(error) bool
    delayTimeSec         int
}

func NewRetryer(isErrorRetryableFunc func(error) bool, delayTimeSec int) *Retryer {
    return &Retryer{
        isErrorRetryableFunc: isErrorRetryableFunc,
        delayTimeSec:         delayTimeSec,
    }
}

func (r *Retryer) IsErrorRetryable(err error) bool {
    return r.isErrorRetryableFunc(err)
}

func (r *Retryer) MaxAttempts() int {
    return MaxRetryCount
}

func (r *Retryer) RetryDelay(int, error) (time.Duration, error) {
    rand.Seed(time.Now().UnixNano())
    waitTime := 1
    if r.delayTimeSec > 1 {
        waitTime += rand.Intn(r.delayTimeSec)
    }
    return time.Duration(waitTime) * time.Second, nil
}

func (r *Retryer) GetRetryToken(context.Context, error) (func(error) error, error) {
    return func(error) error { return nil }, nil
}

func (r *Retryer) GetInitialToken() func(error) error {
    return func(error) error { return nil }
}

func (r *Retryer) GetAttemptToken(context.Context) (func(error) error, error) {
    return func(error) error { return nil }, nil
}

Then, based on this, retries are specified when the SDK is called.

In the following implementation example, the variable retryable contains a function with a decision logic such as "retry if there is an api error Throttling: Rate exceeded message in SDK error response".

Then, in optFn, define "a function to specify a Retryer instance created with the retryable and SleepTimeSec to Options.Retryer" and specify it as the third argument of the SDK call (in this case, DeleteRole).

Example Implementation (Caller)(iam.go)

const SleepTimeSec = 5

...
...

input := &iam.DeleteRoleInput{
    RoleName: roleName,
}

retryable := func(err error) bool {
    return strings.Contains(err.Error(), "api error Throttling: Rate exceeded")
}
optFn := func(o *iam.Options) {
    o.Retryer = retryer.NewRetryer(retryable, SleepTimeSec)
}

_, err := i.client.DeleteRole(ctx, input, optFn)

[Retry Patterns 3.] Golang Generics

The above Options.Retryer follows the official retry method and allows you to define your own logic, but here is a method that allows you to create your own logic.

This method uses Go's relatively new "generics" feature.

Retryer, it is difficult to freely create error messages to be output when an error occurs through retries (e.g., outputting information such as the name of the resource where the error occurred, etc.).

I describe here how to make these points even more flexible.

First, define the retry function with generics in a separate file.

Example implementation (retryer_generics.go)

// T: Input type for API Request.
// U: Output type for API Response.
// V: Options type for API Request.
type RetryInput[T, U, V any] struct {
    Ctx              context.Context
    SleepTimeSec     int
    TargetResource   *string
    Input            *T
    ApiOptions       []func(*V)
    ApiCaller        func(ctx context.Context, input *T, optFns ...func(*V)) (*U, error)
    RetryableChecker func(error) bool
}

// T: Input type for API Request.
// U: Output type for API Response.
// V: Options type for API Request.
func Retry[T, U, V any](
    in *RetryInput[T, U, V],
) (*U, error) {
    retryCount := 0

    for {
        output, err := in.ApiCaller(in.Ctx, in.Input, in.ApiOptions...)
        if err == nil {
            return output, nil
        }

        if in.RetryableChecker(err) {
            retryCount++
            if err := waitForRetry(in.Ctx, retryCount, in.SleepTimeSec, in.TargetResource, err); err != nil {
                return nil, err
            }
            continue
        }
        return nil, err
    }
}

func waitForRetry(ctx context.Context, retryCount int, sleepTimeSec int, targetResource *string, err error) error {
    if retryCount > MaxRetryCount {
        errorDetail := err.Error() + "\nRetryCount(" + strconv.Itoa(MaxRetryCount) + ") over, but failed to delete. "
        return fmt.Errorf("RetryCountOverError: %v, %v", *targetResource, errorDetail)
    }

    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-time.After(getRandomSleepTime(sleepTimeSec)):
    }
    return nil
}

func getRandomSleepTime(sleepTimeSec int) time.Duration {
    rand.Seed(time.Now().UnixNano())
    waitTime := 1
    if sleepTimeSec > 1 {
        waitTime += rand.Intn(sleepTimeSec)
    }
    return time.Duration(waitTime) * time.Second
}

Here is the explanation: Define a function that performs a retry called Retry with the type RetryInput as input.

First, as the type ([T, U, V any]) used for the RetryInput generics, the caller should pass iam.DeleteRoleInput for T, iam.DeleteRoleOutput for U, and iam.Options for V.

ApiCaller is the actual SDK function itself (ex.iam.DeleteRole).

RetriableChecker is a function that defines the logic to determine when to retry.

// T: Input type for API Request.
// U: Output type for API Response.
// V: Options type for API Request.
type RetryInput[T, U, V any] struct {
    Ctx              context.Context
    SleepTimeSec     int
    TargetResource   *string
    Input            *T
    ApiOptions       []func(*V)
    ApiCaller        func(ctx context.Context, input *T, optFns ...func(*V)) (*U, error)
    RetryableChecker func(error) bool
}

// T: Input type for API Request.
// U: Output type for API Response.
// V: Options type for API Request.
func Retry[T, U, V any](
    in *RetryInput[T, U, V],
) (*U, error) {
    retryCount := 0

    for {
        output, err := in.ApiCaller(in.Ctx, in.Input, in.ApiOptions...)
        if err == nil {
            return output, nil
        }

        if in.RetryableChecker(err) {
            retryCount++
            if err := waitForRetry(in.Ctx, retryCount, in.SleepTimeSec, in.TargetResource, err); err != nil {
                return nil, err
            }
            continue
        }
        return nil, err
    }
}

The waitForRetry function is a process that returns an error that outputs an original error message when the maximum number of retries MaxRetryCount is exceeded.

Also, using the context passed as an argument, checks whether the context has been canceled (Done) each time a retry is performed (i.e., whether some error has occurred in some other process and the program should be terminated abnormally), and if it has been canceled, it returns ctx.Err() without executing sleep for the next retry.

func waitForRetry(ctx context.Context, retryCount int, sleepTimeSec int, targetResource *string, err error) error {
    if retryCount > MaxRetryCount {
        errorDetail := err.Error() + "\nRetryCount(" + strconv.Itoa(MaxRetryCount) + ") over, but failed to delete. "
        return fmt.Errorf("RetryCountOverError: %v, %v", *targetResource, errorDetail)
    }

    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-time.After(getRandomSleepTime(sleepTimeSec)):
    }
    return nil
}

Then, with getRandomSleepTime, which appears in the above waitForRetry function, I write the logic to adjust the sleep time for retries.

Here, I am writing a process that randomly waits (Jitter) within a specified upper time limit (sleepTimeSec).

func getRandomSleepTime(sleepTimeSec int) time.Duration {
    rand.Seed(time.Now().UnixNano())
    waitTime := 1
    if sleepTimeSec > 1 {
        waitTime += rand.Intn(sleepTimeSec)
    }
    return time.Duration(waitTime) * time.Second
}

And here is an example implementation of the caller of this Retry function.

Example implementation (caller)(iam.go)

    input := &iam.DeleteRoleInput{
        RoleName: roleName,
    }

    retryable := func(err error) bool {
        return strings.Contains(err.Error(), "api error Throttling: Rate exceeded")
    }

    _, err := retryer.Retry(
        &retryer.RetryInput[iam.DeleteRoleInput, iam.DeleteRoleOutput, iam.Options]{
            Ctx:              ctx,
            SleepTimeSec:     SleepTimeSec,
            TargetResource:   roleName,
            Input:            input,
            ApiCaller:        i.client.DeleteRole,
            RetryableChecker: retryable,
        },
    )

The feature of this method is that by using generics, it is possible to implement retry processing with a generic type relationship guarantee by combining input, output, and options types, even though it is a function created by the user.

However, if there is no particular reason, I think it would be better to use Options.Retryer, which is officially provided.

Finally

Several retry patterns using AWS SDK for Go V2 were presented. In particular, Retryer is not familiar to some of you, so please take this opportunity to use it!