Kenji Koshikawa

Posted on Feb 13, 2022 • Edited on Feb 17, 2022

Retry automatically with Exponential Backoff in Cloud Workflows

#googlecloud #cloudworkflows

Recently I build workflows using Cloud Workflows to combine some modules of Cloud Functions or Cloud Run on Google Cloud.

Sometimes I encountered errors of 429 or 500 related scaling issues when calling an endpoint of Cloud Functions or Cloud Run.

According to this official document, the solution of retrying with exponential backoff is introduced.

The solution:
For HTTP trigger-based functions, have the client implement exponential backoff and retries for requests that must not be dropped.

So this post introduces the solution in Cloud Workflows.

An example of Cloud Functions

I made a simple function of Cloud Functions to reproduce the scaling errors easily like below.

just sleeps 3 seconds
set scale settings to the minimum　(min instance:0, max instance:1)

foobar/main.py

import time

import flask

def main(request):
    time.sleep(3)
    return flask.jsonify({'result': 'ok'})

Following command can deploy to Cloud Functions.

# Deploys the function
$ gcloud functions deploy foobar \
  --entry-point main \
  --runtime python39 \
  --trigger-http \
  --region asia-northeast1 \
  --timeout 120 \
  --memory 128MB \
  --min-instances 0 \
  --max-instances 1 \
  --source ./foobar

# Grants a service account associated with workflows to execute the function
$ gcloud functions add-iam-policy-binding foobar \
    --region=asia-northeast1 \
    --member=serviceAccount:${YOUR-SERVICE-ACCOUNT} \
    --role=roles/cloudfunctions.invoker

A workflow for reproducing the scaling errors

At first, the following code is a workflow for reproducing the scaling errors.

main:
    params: [input]
    steps:
    - callFunc:
        call: http.get
        args:
            url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
            auth:
                type: OIDC
        result: api_result
    - returnOutput:
        return: ${api_result.body}

Following command can deploy to Cloud Workflows.

$ gcloud workflows deploy v1 \
                --source=v1.yml \
                --location=asia-southeast1 \
                --service-account=${YOUR-SERVICE-ACCOUNT}

In order to reproduce the scaling errors, I executed the following shell over 20 times.

$ gcloud workflows run --project=${YOUR-PROJECT} --location=asia-southeast1 v1 --data='{}' &

As expected, 429 error was reproduced many times. The probability of success of the workflow executions was about 6 in 20.

In the console of Cloud Workflows and Cloud Functions, I could see below an error Information.

HTTP server responded with error code 429
in step "callFunc", routine "main", line: 5
{
  "body": "Rate exceeded.",
  "code": 429,
  "headers": {
    "Alt-Svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\"",
    "Content-Length": "14",
    "Content-Type": "text/html",
    "Date": "Wed, 09 Feb 2022 08:17:19 GMT",
    "Server": "Google Frontend",
    "X-Cloud-Trace-Context": "2a8e4ba95570e4a6585a0b678d7f3b98"
  },
  "message": "HTTP server responded with error code 429",
  "tags": [
    "HttpError"
  ]
}

The solution

Next, The following code is the solution to retry automatically with exponential backoff.

In sub-workflow call_api, it is implemented to retry with exponential backoff when returning HTTP status 429 or 500 code. I set the retry count to 5 times and the initial sleep time to 10 seconds.

main:
    params: [input]
    steps:
    - callFunc:
        call: call_api
        args:
            url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
        result: api_result
    - return_output:
        return: ${api_result.body}

call_api:
    params: [url]
    steps:
        - setup:
            assign:
                - retry_count: 5
                - first_sleep_sec: 10
                - sleep_time: ${first_sleep_sec}
        - try_many_times:
            for:
                value: count
                range: [1, ${retry_count}]
                steps:
                    - log_before_call:
                        call: sys.log
                        args:
                            text: ${"call_api url=" + url + " (" + string(count) + "/" + string(retry_count) + ")"}
                    - try_call_block:
                        try:
                            steps:
                                - request_url:
                                    call: http.get
                                    args:
                                        url: ${url}
                                        auth:
                                            type: OIDC
                                    result: api_result
                                - return_result:
                                    return: ${api_result}
                        except:
                            as: e
                            steps:
                                - handle_error:
                                    switch:
                                        - condition: ${count >= retry_count}
                                          raise: ${e}
                                        - condition: ${not("HttpError" in e.tags)}
                                          raise: ${e}
                                        - condition: ${(e.code == 429 or e.code == 500)}
                                          next: log_sleep_time
                                        - condition: true
                                          raise: ${e}
                                - log_sleep_time:
                                    call: sys.log
                                    args:
                                        severity: 'WARNING'
                                        text: ${"got HTTP status " + string(e.code) + ". waiting " + string(sleep_time) + " seconds."}
                                - wait:
                                    call: sys.sleep
                                    args:
                                        seconds: ${sleep_time}
                                - update_sleep_time:
                                    assign:
                                        - sleep_time: ${sleep_time * 2}
                                - next_continue:
                                    next: continue

In the same way, after deploying this workflow I executed it over 20 times.

As a result, All of the workflow executions succeeded and improved. One of them took 3 minutes over, but the retries worked expected.

According to the logs, I could see to wait exponentially each retry time like 10, 20, 40 and 80 seconds so on.

Conclusion

This post introduced an implementation to retry automatically with exponential backoff in Cloud Workflows and showed the solution is effective and enables to continue processing when occurring scaling problems.

(Addition) A simpler way using try/retry statement

There is a simpler way to use try/retry statement. Thanks to @krisbraun who shared at this article's comment.

Pattern1: using default retry policy (very simple)

If your function is idempotent, I think most of the use cases can be covered by the default ${http.default_retry}.

Simple default retry policy for idempotent targets.
Retries on 429 (Too Many Requests), 502 (Bad Gateway), 503 (Service unavailable), and 504 (Gateway Timeout), as well as on any ConnectionError and TimeoutError.
Uses max_retries of 5, and backoff as per retry.default_backoff.

main:
    params: [input]
    steps:
    - call_api:
        try:
            call: http.get
            args:
                url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
                auth:
                    type: OIDC
            result: api_result
        retry: ${http.default_retry}
    - return_value:
        return: ${api_result.body}

Notice: ${http.default_retry} doesn't retry when 500 code.

Pattern2: using custom policy

The following workflow shows to add a retry condition when occurring 500 code in addition to ${http.default_retry} retry conditions. Also, the custom setting for backoff is set to initial sleep time 10 seconds, multiplier 2.

main:
    params: [input]
    steps:
    - call_api:
        try:
            call: http.get
            args:
                url: https://asia-northeast1-xxx.cloudfunctions.net/foobar
                auth:
                    type: OIDC
            result: api_result
        retry:
            predicate: ${custom_retry_policy}
            backoff:
                initial_delay: 10
                max_delay: 300
                multiplier: 2
    - return_value:
        return: ${api_result.body}

custom_retry_policy:
    params: [e]
    steps:
    - assign_retry_codes:
        assign:
            - retry_codes: [429, 500, 502, 503, 504]
    - what_to_repeat:
        switch:
          - condition: ${("code" in e) and (e.code in retry_codes)}
            return: True
          - condition: ${("tags" in e) and ("ConnectionError" in e.tags)}
            return: True
          - condition: ${("tags" in e) and ("TimeoutError" in e.tags)}
            return: True
    - otherwise:
          return: False

Top comments (4)

Kris Braun • Feb 14 '22

Hi Kenji, there's actually a simpler way to achieve this, but we (I'm the PM for Workflows) obviously haven't made it discoverable enough!

To add retries with exponential backoff to any step, simply wrap it in try... retry, providing either a default retry policy or a custom one (where you can specify the backoff parameters). Workflows will take care of retrying the step for you based on the response.

I will look into getting Workflows retry feature mentioned on the Cloud Functions doc page you mention!

Kenji Koshikawa • Feb 15 '22

Hi Kris,
Thank you for your great information.

It looks useful. I didn't know that.

I will try it. After adding verification, I will add notes to this article.

Kenji Koshikawa • Feb 17 '22

I added examples to use try/retry codes.
Thanks to @krisbraun.

blaquiere guillaume • Feb 14 '22

Great article. Maybe having a built-in component to achieve that will save a lot of lines!! I'm going to submit your article to the Workflow development team