DEV Community

Yan Cui
Yan Cui

Posted on • Originally published at theburningmonk.com on

How to do blue-green deployment for Step Functions

A client asked me the other day: “What happens to the running executions when I update a state machine?”

Sadly, the answer is likely that existing executions would break if you have changed the input/output of the Lambda functions they call. The solution is to use specific versions or aliases of the functions instead.

But first, let’s see what happens when you do update a state machine.

The Problem

Day 1, my state machine looks like this.

StartAt: Wait
States:
  Wait:
    Type: Wait
    Seconds: 300
    Next: Hello
  Hello:
    Type: Task
    Resource: arn:aws:lambda:us-east-1:123456789012:function:hello
    Next: Decide
  Decide:
    Type: Choice
    Choices:
      - Variable: $
        StringEquals: Approved
        Next: Success
    Default: Failed
  Success:
    Type: Succeed
  Failed:
    Type: Fail
Enter fullscreen mode Exit fullscreen mode

The hello function just returns “Approved” every time.

module.exports.hello = async () => { 
  return "Approved"
}
Enter fullscreen mode Exit fullscreen mode

Day 2, we need to change “Approved” to “ThumbsUp”, and we no longer need to wait for 5 mins before giving a rating.

So we update the hello function to return “ThumbsUp”.

module.exports.hello = async () => { 
  return "ThumbsUp"
}
Enter fullscreen mode Exit fullscreen mode

And we update the state machine definition accordingly:

StartAt: Hello
States:  
  Hello:
    Type: Task
    Resource: arn:aws:lambda:us-east-1:123456789012:function:hello
    Next: Decide
  Decide:
    Type: Choice
    Choices:
      - Variable: $
        StringEquals: ThumbsUp
        Next: Success
    Default: Failed
  Success:
    Type: Succeed
  Failed:
    Type: Fail
Enter fullscreen mode Exit fullscreen mode

Now it’s time to deploy the update.

Oh, wait! There is already a running execution. What’s going to happen to this execution if we deploy the update?

The good news is that changes to the state machine definition would not impact the state machine definition of running executions. Existing executions would continue along with their original design.

The bad news is that everything else it depends on could have changed. These include the Lambda functions it needs to call and its IAM role. In our case, the existing execution would not receive an “Approved” message when it eventually calls the hello function. It would, therefore, transition to the Failed state as the result of the changes to the hello function.

Instead, can we tie existing executions to the version of hello functions that they were created with?

The Solution

You can invoke a specific version or alias of a function by appending the version number or alias to its ARN. For example, version 2 of the hello function is at arn:aws:lambda:us-east-1:123456789012:function:hello:2.

Since versions are immutable, we can ensure that existing executions would always run against the correct versions of our code.

We also need to make sure the state machine’s IAM role has the necessary permissions. Since all current and future executions would share the same role (which is a whole other problem…), it’s best to grant lambda:InvokeFunction permission for all versions. For example:

Effect: Allow
Action: lambda:InvokeFunction
Resource:
  - arn:aws:lambda:us-east-1:123456789012:function:hello
  - arn:aws:lambda:us-east-1:123456789012:function:hello:*
Enter fullscreen mode Exit fullscreen mode

Ok, let’s see this in action. I published a demo project, you can check out the source code on GitHub here.

The Demo

With the Serverless framework, it’s not easy to find the function versions. Because the AWS::Lambda::Version resources have randomized logical IDs.

So instead, I opted to use aliases with the help of the serverless-aws-alias plugin.

The first version of our code is published with the alias v1, using sls deploy --alias v1 (which the serverless-aws-alias plugin gives you).

I’m able to capture CLI option alias using ${opt:alias}.

Which lets me construct the ARN for the alias like this.

As discussed earlier, I also had to define a custom IAM role to ensure the state machine has permissions to invoke all aliases of the hello function.

Once deployed, the state machine definition would reference the correct resource ARN.

I would start an execution of the state machine. The initial Wait state gives me 5 mins to update the state machine and hello function!

Now quickly switch to the v2 branch and run sls deploy --alias v2 to deploy the update. The hello function would now return ThumbsUp instead, and the state machine has also been updated.

Once deployed, notice that the Resource ARN for the Hello state is now pointing at the v2 alias. And the Decide state is now looking for the string value ThumbsUp instead of Approved.

If we start another execution, it will complete successfully right away.

But if we go back to the Step Functions console, we can see the first execution is still running because of the original Wait state.

After some time, the execution comes out of the Wait state and transitions to the Hello state where it’s still looking for the string value of Approved. But since the state invokes the v1 alias of the hello function, everything still worked as expected.

Both executions completed successfully despite us changing the return value on the hello function.

Future Works

Unlike versions, aliases are not immutable. And with the current set up you still have to remember to increment the alias every time you introduce a breaking change.

This is something we can address in the tooling. I have opened a feature request for the serverless-step-functions plugin. Please let me know if you think it’s a good idea to incorporate this behaviour into the plugin. I will hopefully find time to implement the feature in the coming days/weeks unless there’s strong opposition to it.

Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.

You can contact me via Email, Twitter and LinkedIn.

Hire me.

The post How to do blue-green deployment for Step Functions appeared first on theburningmonk.com.

Top comments (0)