A client asked me the other day: “What happens to the running executions when I update a state machine?”
Sadly, the answer is likely that existing executions would break if you have changed the input/output of the Lambda functions they call. The solution is to use specific versions or aliases of the functions instead.
But first, let’s see what happens when you do update a state machine.
The Problem
Day 1, my state machine looks like this.
StartAt: Wait
States:
Wait:
Type: Wait
Seconds: 300
Next: Hello
Hello:
Type: Task
Resource: arn:aws:lambda:us-east-1:123456789012:function:hello
Next: Decide
Decide:
Type: Choice
Choices:
- Variable: $
StringEquals: Approved
Next: Success
Default: Failed
Success:
Type: Succeed
Failed:
Type: Fail
The hello
function just returns “Approved” every time.
module.exports.hello = async () => {
return "Approved"
}
Day 2, we need to change “Approved” to “ThumbsUp”, and we no longer need to wait for 5 mins before giving a rating.
So we update the hello function to return “ThumbsUp”.
module.exports.hello = async () => {
return "ThumbsUp"
}
And we update the state machine definition accordingly:
StartAt: Hello
States:
Hello:
Type: Task
Resource: arn:aws:lambda:us-east-1:123456789012:function:hello
Next: Decide
Decide:
Type: Choice
Choices:
- Variable: $
StringEquals: ThumbsUp
Next: Success
Default: Failed
Success:
Type: Succeed
Failed:
Type: Fail
Now it’s time to deploy the update.
Oh, wait! There is already a running execution. What’s going to happen to this execution if we deploy the update?
The good news is that changes to the state machine definition would not impact the state machine definition of running executions. Existing executions would continue along with their original design.
The bad news is that everything else it depends on could have changed. These include the Lambda functions it needs to call and its IAM role. In our case, the existing execution would not receive an “Approved” message when it eventually calls the hello
function. It would, therefore, transition to the Failed
state as the result of the changes to the hello
function.
Instead, can we tie existing executions to the version of hello
functions that they were created with?
The Solution
You can invoke a specific version or alias of a function by appending the version number or alias to its ARN. For example, version 2 of the hello
function is at arn:aws:lambda:us-east-1:123456789012:function:hello:2
.
Since versions are immutable, we can ensure that existing executions would always run against the correct versions of our code.
We also need to make sure the state machine’s IAM role has the necessary permissions. Since all current and future executions would share the same role (which is a whole other problem…), it’s best to grant lambda:InvokeFunction
permission for all versions. For example:
Effect: Allow
Action: lambda:InvokeFunction
Resource:
- arn:aws:lambda:us-east-1:123456789012:function:hello
- arn:aws:lambda:us-east-1:123456789012:function:hello:*
Ok, let’s see this in action. I published a demo project, you can check out the source code on GitHub here.
The Demo
With the Serverless framework, it’s not easy to find the function versions. Because the AWS::Lambda::Version
resources have randomized logical IDs.
So instead, I opted to use aliases with the help of the serverless-aws-alias plugin.
The first version of our code is published with the alias v1
, using sls deploy --alias v1
(which the serverless-aws-alias plugin gives you).
I’m able to capture CLI option alias
using ${opt:alias}
.
Which lets me construct the ARN for the alias like this.
As discussed earlier, I also had to define a custom IAM role to ensure the state machine has permissions to invoke all aliases of the hello
function.
Once deployed, the state machine definition would reference the correct resource ARN.
I would start an execution of the state machine. The initial Wait
state gives me 5 mins to update the state machine and hello
function!
Now quickly switch to the v2
branch and run sls deploy --alias v2
to deploy the update. The hello
function would now return ThumbsUp
instead, and the state machine has also been updated.
Once deployed, notice that the Resource
ARN for the Hello
state is now pointing at the v2
alias. And the Decide
state is now looking for the string value ThumbsUp
instead of Approved
.
If we start another execution, it will complete successfully right away.
But if we go back to the Step Functions console, we can see the first execution is still running because of the original Wait
state.
After some time, the execution comes out of the Wait
state and transitions to the Hello
state where it’s still looking for the string value of Approved
. But since the state invokes the v1
alias of the hello
function, everything still worked as expected.
Both executions completed successfully despite us changing the return value on the hello
function.
Future Works
Unlike versions, aliases are not immutable. And with the current set up you still have to remember to increment the alias every time you introduce a breaking change.
This is something we can address in the tooling. I have opened a feature request for the serverless-step-functions plugin. Please let me know if you think it’s a good idea to incorporate this behaviour into the plugin. I will hopefully find time to implement the feature in the coming days/weeks unless there’s strong opposition to it.
Hi, my name is Yan Cui. I’m an AWS Serverless Hero and the author of Production-Ready Serverless. I have run production workload at scale in AWS for nearly 10 years and I have been an architect or principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. I currently work as an independent consultant focused on AWS and serverless.
You can contact me via Email, Twitter and LinkedIn.
The post How to do blue-green deployment for Step Functions appeared first on theburningmonk.com.
Top comments (0)