A common pattern for teams moving from cron-based jobs to event-driven architectures is to start using AWS Lambda as the compute engine for asynchronous event handlers. This can work out very well for simple event handlers, but I've observed a common antipattern that can appear when workflows grow more complex. Sooner or later workflows with multiple steps enter the picture and we'll need to be careful in our approach lest the complexity get the best of us.
Asynchronous Event Handler
Using Lambda as an event handler can be a pretty good plan. A large number of AWS services can target Lambda, making integrations easy. For third parties, you can use a FURL to create a low-friction webhook. Lambda of course has a pricing model that is ideal for irregularly-scheduled events. It's relatively easy to compose and deploy a Lambda function.
Fan-Out with SQS
This simple architecture can cover a lot of use cases, but not all of them. What happens when complexity grows? The main driver I've seen for adopting more complex architectures is when the job won't complete within 15 minutes, but there can be other reasons for doing this as well, such as asynchronously communicating with some other system in the middle of the job.
Let's say our job is running long and we decide to adopt a fan-out pattern. We can achieve this by putting messages in a queue using SQS and subscribing another Lambda function to those messages. This will give us the opportunity for batching and throttling to make sure our workflow can run in parallel, but not overwhelm other resources.
Fan-Out with Job State Tracking
Okay, now our workload is broken down some, but we have a new problem. How do we know when the job has finished? How do we know the success/failure rate of each step in the job? What if we only want one instance of our job to run at a time? We can't prevent a new job from kicking off if we don't know whether or not the old one has finished.
SQS is a great service, but it's not a workflow engine! It won't maintain job state for us, so if that's important, it will be up to us to implement it. We could store the current job state in a DynamoDB table, writing the number of expected downstream jobs and tallying them as we go, maybe including another step at the end to give us a summary report. Okay, now our architecture looks like this:
To achieve this architecture, each of our Lambda Functions is now responsible for job tracking. We've more or less had to write our own workflow engine. We'll own that code and need to maintain it. This is an antipattern that violates the promise of managed services, misuses SQS, and forces us to add boilerplate, workflow-aware code to every Lambda Function.
Enter Step Functions
AWS Step Functions is a great tool to solve this problem. Step Functions allow us to focus on our business logic while workflow and job state are handled by a managed service. We can query the service or check out the console visualization to find out whether or not a job is in progress or finished.
All the problems outlined above are solved by building the workflow with Step Functions. In the provided example, we still have the Job Start function, which perhaps performs an initial database query to drive the rest of the job, or it could query the Step Functions service to see if another execution is underway (for use cases where we want to limit concurrency). If none of that applies to our workflow, we can skip this step! Likewise we could skip the summary step since Step Functions will be able to provide the output of the work that was done by each of our workers and as such a summary step would only be needed if we wanted to format a human-readable report.
Learning Step Functions
Step Functions can be complex and challenging to learn. They are described using Amazon States Language. This complexity can drive developers to avoid this service, leading to subpar architectures. I acknowledge there's a lot to learn here, but it's well worth it! And it doesn't need to be an uphill battle.
The Workflow Studio is a great way to get get productive fast. You may also be interested in an article I wrote about learning to write Step Functions using AWS CDK. If that doesn't do it for you, there are a number of other free resources on the web that can put you on the right path.
You Also Get...
I left off a lot of the great features of Step Functions by focusing on the comparison to an SQS-driven workflow. Step Functions also support:
- synchronous express workflows
- native integrations (Lambda-less)
- asynchronous invocation and callback with Task Tokens
- time-based Wait steps
- Intrinsic Functions (again Lambda-less)
- automatic Retry
Know and Plan Your Architecture
The best path to success in cloud is to understand our objectives and know which services will help us to achieve them. Sometimes all we need is a single Lambda handler. If "fire and forget" is good enough, then an SQS-based fan-out is probably fine. Still, the use-cases for more complex workflows is indelible and more often than not, we'll need the ability to audit them. Step Functions is a great tool for achieving that aim.
More on Step Functions
As mentioned above, there's a lot of great articles out there on Step Functions to help in your journey. Here are a few of my favorites - and please feel free to add to the list in comments below!
Top comments (0)