DEV Community

Cover image for The Ultimate Guide to AWS Step Functions
Taavi Rehemägi for Dashbird

Posted on • Updated on • Originally published at

The Ultimate Guide to AWS Step Functions

The use of serverless computing has become a must nowadays, and some of you may already know a thing or two about Amazon Web Services like Lambda Functions, Step Functions, and other services AWS provides. However, if this is the first time you hear about them -- fantastic!

In this article, we'll discuss AWS Step Functions, what they are used for, how to use them, and the advantages or disadvantages that they bring.

AWS Step Functions 101

Before we can jump into Step Functions, you need to familiarize yourself with the basic structure behind them. Step Functions are an AWS-managed service that uses Finite-State Machine (FSM) model.

Why is that important?

By coordinating multiple AWS services into different serverless workflows, you can quickly build and update the apps. Additionally, with Step Functions, you'll be able to both design and run workflows that'll bring together various services, including Amazon ECS and AWS Lambda, into feature-rich applications.

FSM Model Explained

The two keywords you need to remember are States and Transitions.

Now, why are these words so important?

FSM model does a simple job -- it uses given states and transitions to complete the tasks at hand. Finite-State Machines are also known as a behavioral model. It's an abstract machine (system) that can be in 1 (one) state at a time, but it can also switch between a finite number of states.

This machine is defined solely by its states and the relationship these states have between themselves. A good and very straightforward example is the "closed-door example." The door can either be open or closed, and these are the only two possible states. Now, the transition part is the switch between two states, but you need to provide some input first to get there. When you close the door, you're placing an input. Additionally, the sequence of opening the closed door is known as the switch between two states (transition).

Alt Text

StepFunctions state input and transition

You can also apply the same thing to other examples, like your daily life routines. Let's take "Work, home, bed" as states. From work (state), you take a bus (input) to get home (state), and when you arrive home, you go to bed (another input leading to another state). Tomorrow morning when you wake up and get out of your bed, you're transitioning from the last state you were in into a previous one, and yet again, taking the bus from home to work is one more transition.

There are other, more complex examples with many more states, inputs, and transitions between them, and the more states you add, the more complex the FSM model becomes.

The conclusion is simple -- FSMs are a method of modeling your system by defining the states and transitions between those states.

What Are Step Functions?

Step Functions are Amazon's Finite-State Machines service that's entirely managed while also being serverless. Step Functions are made of state machines (workflows) and tasks. Tasks are individual states or single units of work.

In the following graph, you'll see an example of Amazon's State Machine in which the green rectangles represent the states. The result leads to another state, which then leads to the choice that depends on the given input (email or SMS). In this example, the green states were successfully executed, while the white-colored state wasn't.

Alt Text

Amazon's state machine

This entire graph representing the state machine is also known as a Workflow, and there are two types of workflows available.

Step Function Workflow

Types Of Workflows

Workflows are divided into two groups: Standard and Express workflow. Unlike Standard, the Express workflow is a relatively new option that has been available since last year. The table below shows the differences between these two workflow types.

Alt Text

State machines orchestrate the work of Lambda functions. When one function ends, it triggers another function to begin. Although Max Duration is significantly different, Express workflow allows more scalability. Moreover, Express workflow pricing is constructed with more details since users will have to pay for the number of executions, including the duration and memory used for those executions. Standard workflow pricing requires users to pay only for each state transition that occurs.

It's important to note that Standard workflow is a long-running workflow that has to be durable and auditable. In contrast, the Express workflow type is needed for a much higher workflow and event processing volume.

Workflow Execution

Now that you know the basics, the next step of the way is the execution. To trigger the workflow to start the execution against Step Function API, you can use CloudWatch events as a time trigger or use API Gateway as a proxy.

State Types

It's essential to remember that States aren't the same thing as Tasks since Tasks are one of the State types. There are numerous State types, and all of them have a role to play in the overall workflow:

  • Pass: Pushes input to output.
  • Task: Takes input and produces output.
  • Choice: Allows the user to use Branching Logic that's based on the input.
  • Wait: It adds delays to State Machine execution.
  • Success: Has an expected dead-end that stops execution successfully.
  • Fail: Has an expected dead-end that stops execution with a failure.
  • Parallel: Allows a user to implement parallel branches in execution, meaning the user can start multiple states at once.
  • (Dynamic) Mapping: Runs a set of steps for every input item.


Tasks are the leading States in which all the work is done. Tasks can call Activities (remote executions):

  • Call an execution on either ECS, EC2 machines, or mobile devices.
  • Sending SMS notifications and wait for the input.

Another constructive element that Step Functions Tasks provide is that it allows you to reach out from your AWS space.

Error Handling

Error handling includes retries and catch. An excellent example of how does Step Functions work is shown in the graph below:

Alt Text

Step Function visual workflow

In this example, you can see the Parallel branching task. This task is a perfect example representing how the entire execution will fail if only one state encounters an error.

Users are provided with Amazon State Language that helps them catch those errors and define all the retries. All this is extremely important for business-critical operations.

Amazon State Language allows you to place a comment, define when the state should start, and define the states and tasks. Moreover, suppose a customer handled an error. In that case, this tool allows you to specify the retries based on the error name, but also to specify the retry interval, as well as the number of retry maximum attempts, and backoff rate, which you can see in the example below:

Alt Text

Amazon State Language: Error handling example

In case you wish to catch errors, you'll see why some states weren't executed and which tasks have failed. See an example of how to catch an error in the graph below:

Alt Text

Amazon State Language: Catch Errors example

The first retry attempt will start at the pre-determined interval, and it gets multiplied by the backoff rate you've set.

Error handling is critical because if Parallel tasks execute successfully, but one fails, the entire execution will fail. However, even if the entire execution fails, the state changes will remain intact.

Error handling allows you to track everything that's happened in the log, and by doing so, you'll have a better insight on why some errors happened so you could handle the core problem.

Step Function Demonstration

Choice Step Function Example

You'll have to input a preferred number into your function. For example, if you chose a number 10 and a customer buys more than ten items from you, the Step Function will execute successfully by following a preferred choice. In case a customer buys less than ten items, the execution will also be successful, but under a different pre-set choice.

Retry Step Function Example

It's required that you specify the number of retries for all errors and define the interval, along with the number of maximum attempts. There you'll see that it's retrying to execute the function until it finally fails.

When the execution fails, you can check the logs in the Execution Event History tab to see how many times it started, but you'll also be able to see the retries in between each start. You can access detailed information by following the CloudWatch logs link within the Event History tab.

Catch Step Function Example

Input a specific number range, and the function will execute based on the pre-set number. If the execution number goes outside the pre-set number range, the function will report a Range Error. However, if the number goes below the minimal pre-set number higher than zero, the function will report a Custom Error.

Error catch and error handling are essential for Step Functions since it allows for a successful, and error-free function execution.

Step Functions Use Cases

When To Use Step Functions?

Step Functions Standard workflow is excellent for business-critical workflows and brings along numerous business benefits. It provides much better error handling logic that Lambda Functions, while it's relatively easy to orchestrate them. On the other hand, it's meant more for business-critical ones since it pretty expensive compared to Express workflow. The Standard workflow price is USD$25 per one million executions with the additional cost for memory and duration of use. If you'd like to learn more about saving money on your AWS Step Functions, then check our article on how to cut cost on Step Functions on Enterprise-Scale workflows.

Complex workflow allows you to handle an incredibly large amount of states. Besides, Complex workflow is excellent for orchestrating microservices since you won't need to build a connection between them and you can call out different languages from different services.

Step Functions are also incredibly useful for long-running or delayed workflows. It allows you to have a workflow for up to a year while also implementing the waiting state.

Step Function Best Practices

One of the best use practices of Step Functions is for large payloads. By putting payloads in S3 and importing them to Step Functions, you'll be good to go. If you don't, your workflow might fail. You can easily do it by specifying the location of S3 with an "arn" like shown in the example code below:

Alt Text

Step Functions: Import S3 payloads

Use Step Function Timeouts

Using timeouts will help you avoid stuck executions since there are no default timeouts in Step Function tasks. Moreover, Step Functions rely on the activity worker's response.

How To Handle Lambda Exceptions?

Lambda can have very short-lived service errors. This is why it's good to add Lambda service exceptions since it's excellent at handling these exceptions proactively, as shown in this example:

Alt Text

Step Functions: Handling Lambda Exceptions

Integrations & Devs

Possible Integrations

There's a dozen of services available for integrations which you can use, and you can integrate them from the Tasks:

  • Submit Amazon Web Services batch job;
  • Use CodeBuild;
  • Get or put items in DynamoDB table;
  • Run Amazon's ECS;
  • Integrate with EMR;
  • Run Amazon's Fargate task;
  • Integrate with Glue;
  • Invocation of a Lambda function;
  • Use SageMaker's machine classification, inference, and machine learning model training;
  • Use Topic to publish a message;
  • Send messages to SQS queue;
  • Step Functions

Alt Text

Step Functions: Integrations

Dev Tools

Step Functions provide developers with a Serverless-Step-Functions plugin that's used in a serverless framework. It allows you to do everything Step Functions can do, while it helps devs take care of the rows and many other things they need to define.

It's possible to download Step Functions as a Java .jar file or a Docker image so you can run it on your machine.

It's also vital to keep on top of your Step Functions' performance. This is where third-party tools like Dashbird, for instance, come in! Step Functions publishes events and metrics to CloudTrail and CloudWatch which are monitored by Dashbird. Dashbird's Insights engine detects errors related to state machine definitions or task execution failures in real-time and notifies you immediately, via Slack or email, when something within your workflows breaks or is about to go wrong. The Insights engine is based on AWS Well-Architected best practices and constantly runs your whole serverless infrastructure's data against its rules, to help you make sure your app optimized and reliable at any scale.

Alt Text

Dashbird Insights for AWS services

Step Function Advantages & Disadvantages

Although Express workflow is much cheaper than the Standard workflow, it has a visual disadvantage. Express workflow doesn't have any visual aid that helps monitor your executions since it pushes the information to the CloudWatch log. Although it provides exceptional insights, the lack of visual aid might seem challenging, especially with too many executions at hand. It might seem like a challenging task to recuperate what's failing and what's not.

Final Thoughts

Step Functions are AWS's relatively new product that will undoubtedly change your performance rates by allowing you to break down your applications into basic service components. From there, you'll be able to manipulate each of these components individually. That's why Step Functions are quite helpful for achieving higher performance rates, but they'll also allow you to break down your application into service components and manipulate them all independently.

Although Express workflow of Step Functions is a new workflow type, people are eager to see its full potential and find out as much information as possible. If you have any experience with Step Functions you'd like to share with our readers or if you know something we don't, feel free to share your thoughts in the form of a comment in the comment box below.

Top comments (1)

starpebble profile image

Hi Taavi! This is a very nice article. Sometimes we have to make difficult choices. For example, I really like the maxAttempt feature of the state machine execution. It is a simple way to increase the resiliency of a state machine just like the other recommendations in this helpful article. Example: a step that invokes a lambda function may fail because S3 throttled getObject(). A simple automated step retry works.