Warren Parad

Posted on Dec 20, 2021 • Originally published at dev.to on Dec 20, 2021

AWS Step Functions: Advanced

#microservices #awslambda #aws #stepfunctions

The optimized step function

This is the advanced guide to using AWS Step Functions. Step functions enable complex state machines. They track state and execute lambdas exactly when you want them to be used. That is when everything goes well, but happen when it doesn’t?

Step functions are one of the most interesting, less understood, and least effectively utilized services in AWS. Their complexity allows for high reduction effort and cost to create services with complicated flows. However because of this, great understanding is required to use them effectively.

Here, I’ll be discussing our usage of Step Functions in Authress and how we use them to ensure high level of security and a high level of quality in our services.

Choosing the right pattern

Step functions enable async execution. Follow up actions. Execution at a particular later time. For standard async execution in AWS there are exactly two straightforward options:

SQS — Simple Queue Service (specifically regular not FIFO queues)
SF — Step functions

(What’s not included here is EventBridge, SNS, and recalling the same lambda with an async invoke, there are less trivial and don’t offer the either the simplicity or the control you might want.)

Both offer async processing, both expect handling to be idempotent, both offer retries, and error handling. But SQS lacks control. The trade off is that SQS is far cheaper than SF when you don’t need something complicated.

Step functions can also enable the wrapping other services that don’t support concurrency. In SQS you get one handler, that means every action in the handler must be idempotent (except potentially the last step). But with SF you prevent any repeat execution by moving every action to a different step.

The problem is that you want to avoid more than one step as much as possible, since they are costly. SF charges for every state change:

Every state costs a large amount

If not careful, SF will be your most expensive AWS service usage.

Use cases

Administrative execution : SF should only be used for admin actions and never be used for transactional execution. With a SF with just one step, if your service has 1M calls per day, you will be spending $750/month. If your step function has 10 steps and you graduate to 10M calls per day that is $900k /year. Sure this may seem like a lot of calls, and the service is worth it, but you probably want to think really hard before going down this route.

If you have a transactional use case, SQS or EventBridge are better options. Just combine the steps and make sure each one of the idempotent. Then your 10M calls per day drops from $1M per year to $1500. That’s a 99.9% cost reduction.

Execution Delays: SF usually has a shorter delay before execution than SQS. SQS messages are great if you want them delayed for up to 15 minutes. But that’s the limit. They also have a built in delay before triggering lambdas, which can be up to 500ms. Immediate execution is not something that is reliable with SQS. SF usually get down to 100ms. That is the round trip time from Lambda to SF back to Lambda is 100ms. In these situations SF is the right choice.

Wait times: If you expect a user to perform an action out of band, and that action allows continuity of a state machine, SF is the right choice. Assuming this is a configuration or administrative action in your app (and not transactional for the above reasons), and you expect long wait times, then it makes sense to use a step function. Below we’ll jump through the domain registration SF for Authress, so you can see exactly why wait times are valuable.

Configurable Retries: SQS allows for exactly one kind of retries, fixed time window delays. That is retry in exactly 3 minutes, continuously, until the retry limit is reached, or the problem is fixed. SF on the other hand have built in configurable exponential back-off and/or configurable retry delay driven directly from your code.

Breaking down an optimized step function

Here we’ll discuss a a complex step function involving many steps which includes code examples and explanations.

Here are some optimization rules:

Never use the choice step — It is always cheaper, easier, and better to execute linear steps. You can use the data exported from the previous step to drive a switch in the next step. The choice step takes up 2 extra state changes. While it seems like one extra, in practice you’ll have to structure the state change as one, and then create one for the choice state. If 99% of the time your SF takes a single path, then you are paying for 2 state changes 99% of the time without getting any value. If your step function takes one of two paths every time, then make 2 step functions. SF are free, choice steps are not.
Also don't use DDB or other SDK steps - This goes hand and hand with the previous point. There is a raw cost here, but also storing DB objects in the step function just isn't going to work reliably in practice. For one the max size of a step function object is 256KB, whereas a DDB item can be 400KB and if you use query merges that's 1MB per item. Long term, you'll run into a critical failure. The other aspect here is consistency, you are duplicating database data, which means if there is a step failure or the data in the DB changes simultaneously you'll have a data sync issue. Your next step is going to be a lambda, so just load it there. It's easier, faster, consistent, and uses the data tools you are already using. You don't need to set up an extra step, leave the step function to manage the workflow, not the dataflow.
Always have a Retry States.ALL error handler with a number of retries, and make sure that — If your execution is time bound, the retries are lows, but if the execution is critical to happen that the retry time period is at least 2 days. In most situations we recommend a week. This means if there is an error it will still attempt to retry a week from now. Why a week? What happens if your team is on vacation? You want to be able to resolve the error with code, without having to do some manual labor to start the step function again. You never want the step function to crash.
Step functions should only finish if worked. A step function that crashes and errors out which stops it from execution creates a real error message to log. If you Catch the error using the catch configuration, you’ll never track the problem. The completion of a successful step function and crash of a failed one allows ease of investigation, and ease of retries. Set the logging configuration of the step function to be Level: Fatal.
Step functions can be used to ensure idempotency by reusing the name of the step function as a unique key.
Store the inputs and outputs from every state in a separate object in the step function. Never throw away data in the step function, everything can be valuable, so keep track in separate pieces and then when you need data it is there. In some cases the data in the step function represents an important state difference from the actual data in the DB. This is an opportunity for usage.
Every action in a step must be idempotent (except for the last action). This should be an obvious observation. Your step will be retried, so you can only guarantee the last action won’t get double executed. But in every situation, you should plan for some possibility of duplicate execution of this last action in every step.
Always reuse the same lambda for every step. Creating separate lambdas for different steps is a great way to make it impossible to deliver consistent behavior and expectations on steps. When you deploy a new version of your step function, only new executions will trigger the new configuration. Avoiding the need to change steps in a function offers huge value.

The Code

There are three parts to a step function:

Configuration — The step function configuration
Handler — A code wrapper for the execution of the lambda which optimizes reusing a step function, as well as error handling
Lambda execution code — The actual code relevant to the step

Let’s take a look at the custom domain configuration step function in Authress. Remember this is an administrative action. It happens about once per account, since accounts don’t change their domain unless there is an error or a one of rare scenario.

The Step Function Configuration

The optimized step function

You’ll notice that there are no loops, there are no branch steps, and there are no choice functions. This is incredible optimized however. Do not assume that because there are only three linear steps that it is simple or needs to be improved. If your step function doesn’t look like this, you’ll want to start thinking about how it can be improved.

{
  "StartAt": "Start",
  "States": {
    "Start": {
      "Type": "Task",
      "Resource": { "Ref": "LambdaFunctionProductionAlias" },
      "Parameters": {
        "context.$": "$$",
        "parameters.$": "$"
      },
      "ResultPath": "$.startResult",
      "Next": "Verify",
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 10,
          "MaxAttempts": 16,
          "BackoffRate": 2
        }
      ]
    },
    "Verify": {
      "Type": "Task",
      "Resource": { "Ref": "LambdaFunctionProductionAlias" },
      "Parameters": {
        "context.$": "$$",
        "parameters.$": "$"
      },
      "ResultPath": "$.verifyResult",
      "Next": "Cleanup",
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 10,
          "MaxAttempts": 19,
          "BackoffRate": 2
        }
      ]
    },
    "Cleanup": {
      "Type": "Task",
      "Resource": { "Ref": "LambdaFunctionProductionAlias" },
      "Parameters": {
        "context.$": "$$",
        "parameters.$": "$"
      },
      "ResultPath": "$.cleanResult",
      "End": true,
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 10,
          "MaxAttempts": 19,
          "BackoffRate": 2
        }
      ]
    }
  }
}

Code Gist

Where is the Catch configuration? There is none, if the step fails more than 19 times, then it hard errors. Remember this should never happens and when it does we want to log a Critical error for us to track down later. Otherwise this should retry.

How long is 19 retries, well the last retry happens after 10 + 2**19 = 6 days. (with 16 retries is only 18 hours). It’s possible that are infra to handle custom domains is down for multiple days. There isn’t anything we can do, so logging an error is wrong. Instead we’ll just retry, but make sure the user is aware of the problem as it is exposed. Why are we retrying for so long? Each failure here is one step, if we have a retry loop, that would be step => choice => step, which would an extra choice function. Instead we just throw an error and retry like.

These retries exist because Authress expects the user to configure their DNS records, that step can happen at any point, but don’t know when. For this we have chosen to use exponential back-off.

You’ll notice two important pieces here:

"Parameters": { 
  "context.$": "$$",
  "parameters.$": "$"
}

And

"ResultPath": "$.startResult"

Parameters context.$ just creates a parameter called context that we pass into our lambda. The $ is just SF magic convention so ignore that. The important part is $$ means pass everything to the context parameter. If we didn’t do this we wouldn’t get any input to our step function. Additionally, you might be thinking, isn’t it better to only pass $.input or $$.State maybe. But you’ll find out that using InputPath: $ is a waste, when you really want the whole SF context passed in.

The ResultPath says stored the response of the step function in the startResult property so that it can be referenced in later steps as parameters.startResult.subProperty If you never need it, then it doesn’t matter, but during investigation, it will show up in the console.

The parameters.$ gets the current tracking state of all the previous states’ result objects.

Step function step output

The Lambda Handler

The simple handler below automatically figures out which method to call based on the trigger passed from SF to lambda. Here we utilize the StateMachine name and the name of the State to drive the switch. Additionally unless there is an error with a code which isn’t ForceRetryExecution , we don’t unnecessary log and contaminate. The function just retries according to the configuration, and that’s the end of the story.

async onEvent(trigger) {
    if (trigger.context && trigger.context.StateMachine) {
      const processorId =
            `${trigger.context.StateMachine.Name}|${trigger.context.State.Name}`;
      const payload = trigger.context.Execution.Input;
      const parameters = trigger.parameters;
      const context = trigger.context.State;
      const processors = {
        'StepFunctionName|Start':
          () => handler.start(payload, parameters, context),
        'StepFunctionName|Verify':
          () => handler.verify(payload, parameters, context),
        'StepFunctionName|Cleanup':
          () => handler.cleanup(payload, parameters, context)
      };

      if (!processors[processorId]) {
        logger.log({
          title: 'Step Function processor does not exist for type',
          level: 'ERROR', trigger, payload, stepFunctionContext
        });
        return {};
      }

      try {
        const result = await processors[processorId]();
        return result;
      } catch (error) {
        if (error.code !== 'ForceRetryExecution'
            && error.message !== 'ForceRetryExecution') {
          logger.log({
            title: 'Error handling StepFunction trigger',
            level: 'CRITICAL',
            error, trigger, payload, stepFunctionContext
          });
        }

        if (stepFunctionContext && stepFunctionContext.RetryCount > 10) {
          logger.log({
            title: 'Too many retries, log as ERROR.',
            level: 'ERROR',
            error, trigger, payload, stepFunctionContext
          });
        }
        throw error;
      }
    }
  // Handle other types ...
}

Code Gist

The code steps

You can imagine what the actual functionality might be in each of the steps, where the user:

Start: Creates the custom domain and validation data. Authress configures our DB and prepares the custom domain
Verify: The step function waits for the user to have completed the necessary actions. It will throw an error to retry if the user hasn’t completed the necessary actions yet.
Cleanup: It doesn’t matter what the previous step has done, Verify will always result in a success, but with different exit codes. Depending on the state of the of the Step Function, the configuration in the DB, and the setup of the custom domain Cleanup executes different code. It might exit with no action, or delete all unnecessary stored data.

It doesn’t matter how the external state changes every action is idempotent and validates the current state to make sure it is what the step function expects. The data which is immutable passed around the lambda function is stored in the execution parameters input or in the result object so that it can be used without unnecessary DB calls.

The Conclusion

There is a lot of opportunity here for complexity, with selective picking of which features to use and how to configure each of these, you avoid a lot of unnecessary confusion later. Logging problems, re-utilizing lambda functions, and driving configuration directly from the step function state names creates the best developer experience you can want in your team.

Don’t make things complex, this is true for step functions as well. They should be linear, they should work from beginning to end, and make sure to bubble up exceptions with log messages when the RetryCount raises above your personal tolerance. Sometimes this is 10, other times it is 20. Have retries set at 33.

Come join our Community and discuss this and other security related topics!