Best practices for using AWS StepFunctions

#aws #serverless #bestpractices

In this post you will learn some of the best patterns/tricks I have learned during my time creating Step Functions workflows, while working with clients at Accenture.

Table of contents (clickable)

Make use of the Step Functions Workflow Studio
Utilize the service integrations
Use .waitForTaskToken
Make use of the inbuilt retries
Utilize Heartbeats to fail fast
Define a Catch Handler
Further reading

Make use of the Step Functions Workflow Studio

Since it has been introduced in June Step Functions Workflow Studio proved its value to me several times. As its a low-code editor with the most common configurations already baked-in, the effort of writing/designing a workflow with Amazon States Language plummeted. Those minutes and hours handwriting workflows with 100 lines++ are finally over for good, which is something you must not ignore when dealing with Step Functions.

As visualized below, the option to create a workflow from scratch in Workflow Studio directly is already present:

As is the option to edit pre-existing workflows directly in Step Functions:

Utilize the service integrations

While AWS offers some 17 "optimized" service integrations (for the definitive list see here), that include different custom options of integrating with the specific services, AWS has released an option to call the APIs of nearly all AWS services directly, as described in this article. This allows you to scrap some of the utility lambdas one uses to add much-needed functionality-augmentation to a service and go with Step Function instead.

Use `.waitForTaskToken`

By using .waitForTaskToken, you are able to transparently pause the workflow, until a task like a lambda function has finished executing.

Be aware, that you need to specify the Task Token in the payload for the lambda, as Step Function does not inject it automatically for you.

Example

code example

This example shows how to send the task token for success/failure back to Step Functions via AWS JS SDK v3.

import {
  SFNClient,
  SendTaskFailureCommand,
  SendTaskSuccessCommand
} from "@aws-sdk/client-sfn";
const client = new SFNClient();

async function success(taskToken, input) {
  const stepFunctionsCommand = new SendTaskSuccessCommand({
    taskToken,
    output: input
  });
  await client.send(stepFunctionsCommand);
}

async function failure(taskToken, cause, error) {
  const stepFunctionsCommand = new SendTaskFailureCommand({
    taskToken,
    cause,
    error
  });
  await client.send(stepFunctionsCommand);
}

async function main(event) {
  try {
    await success(event.MyTaskToken)
  } catch (error) {
    console.error(error)
    const {
      requestId,
      cfId,
      extendedRequestId
    } = error.$metadata;
    await failure(event.MyTaskToken, {
      requestId,
      cfId,
      extendedRequestId
    })
  }

}

Utilize Heartbeats to fail fast

As StepFunctions can run for up to a year (at least Standard workflows) it is imperative to avoid stuck executions. One way of doing this when integrating with Lambda is the Heartbeat API and specification. This allows developers to specify a max-interval, in which the Heartbeat has to be send back to Step Functions. Failure to meet this deadline leads to termination of the task.

Example

code example

In the following example a task is specified with a max-heartbeat duration of 10 minutes.

{
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "HeartbeatSeconds": 600
}

This code-example shows how to send back a heartbeat with AWS JS SDK v3.

import {
  SFNClient,
  SendTaskHeartbeatCommand
} from "@aws-sdk/client-sfn";
const client = new SFNClient();

async function heartbeat(taskToken) {
  const stepFunctionsCommand = new SendTaskHeartbeatCommand({
    taskToken
  });
  await client.send(stepFunctionsCommand);
}

async function main(event) {
    await heartbeat(event.MyTaskToken)

    // some expensive calculation

    await heartbeat(event.MyTaskToken)
}

Define a Catch Handler

Rationale

To quote Werner Vogels, Amazon CTO:

everything fails, all the time

Be prepared for the wildly different and sometimes unexpected errors the AWS APIs can throw, by catching them like you would in a lambda function (if you don't do that we are having a whoole different conversation).

Example

In this pretty simple example, the catch block is used on a lambda task. It works on the other tasks as well. This example uses a catch-all error code, for other error codes see here.

As it is usually a good idea to get notified when an error occurs, this example publishes to a SNS topic, which may have an email subscription. I'd reccomend to use this only for really critical errors, as you may otherwise miss important errors in your then-cluttered inbox.

code example

{
  "Invoke Lambda Task": {
    // [..]
    "Catch": [
      {
        "ErrorEquals": [
          "States.ALL"
        ],
        "Next": "SNS Publish"
      }
    ]
    // [..]
  },
  "SNS Publish": {
    "Type": "Task",
    "Resource": "arn:aws:states:::sns:publish",
    "Parameters": {
      "Message.$": "$"
    },
    "Next": "Fail"
  },
  "Fail": {
    "Type": "Fail"
  }
}

Make use of the inbuilt retries

Instead of catching errors and terminating the flow then, one might as well use retries to try recovery from an error or to await a desired state. One may use this for example to await results from APIs similar to those of a Glue Crawler, which need repeated polling and potentially exponential backoff.

Example

code example

In this example a retry from specific Lambda exceptions is shown. The IntervalSeconds parameter defines an initial offset, which has to pass before the first retry is attempted. The BackoffRate parameter specifies the duration-multiplier which is applied after each unsuccessful attempt. Step Functions will retry after 2,4,8,16,32,64 seconds, limited by the MaxAttempts parameter of 6.

{
  "Invoke Lambda Task": {
    // [..]
    "Retry": [
      {
        "ErrorEquals": [
          "Lambda.ServiceException",
          "Lambda.AWSLambdaException",
          "Lambda.SdkClientException"
        ],
        "IntervalSeconds": 2,
        "MaxAttempts": 6,
        "BackoffRate": 2
      }
    ]
    // [..]
  }
}