Robert Slootjes

Posted on Aug 29, 2022

Dynamic Rate Cron using Step Functions

#aws #serverless #architecture

This is a verbatim copy of an article I published on Medium.

When I was reading this great article Another Way to Trigger a Lambda Function Every 5–10 Seconds it looked a bit familiar to me. For a project I was working on a few years ago I took a similar approach to use Step Functions to allow functions to run as quick as every second. But instead of having a fixed rate the Lambda function needs to run, the rate needed to be adjustable and be applied instantly through an admin panel. In this article I will explain how that was achieved back in 2018 and how I’ve improved it while writing this article.

CronJob — Author: Seobility — License: CC BY-SA 4.0

The Why

I thought it might make sense to explain first why my project needed functionality like this as otherwise it might not make sense. This requirement came from a custom serverless webshop we built for our client ID&T; organizer of world leading dance festivals like Defqon.1, Mysteryland and Qlimax. The demand for tickets is very high and when the ticket sales opens there are tens of thousands of eager fans waiting in a queue to buy tickets. During, as we call it, “peak sale mode” we want certain processes like statistics and clean ups to happen every few seconds. Outside of the peak sale these processes are less important and we can run these once every few hours. Putting the webshop into peak sale mode can be done by the client themselves and needs to change the rate of the processes immediately. We could have chosen to always run these processes as quick as during peak sale mode but obviously that would cost way more money for a things we only need to run that fast a few times per year. The great thing about Step Functions with Standard Workflow is that you don’t have to pay the time it’s waiting. I’m planning to write more articles about the serverless webshop in the future as we did some more things for this project that might be useful to others.

Concept

It boils down to pass the rate in seconds and function as the input to the Step Function execution. The Step Function will invoke the Lambda using the Task state and then wait for the specified amount of seconds using the Wait state. When the rate changes, just start a new execution and kill the current execution (if any) using the Step Functions API and continue as usual. This way you end up with an infinite loop of executions that should not stop until you tell it to stop.

In both solutions the input kind of the same; you define a rate and the name or ARN of a Lambda:

{“task”: “lambda-name-or-arn”, “rate”: 10}

We do need to keep in mind that Step Functions has a maximum running time (1 year) and a maximum amount of steps (25.000) as documented. To overcome the risk that the execution will just time out or reach more steps than allowed we need to keep track of how many times the function was invoked. When we risk getting to the maximum amount of steps, the workflow will create a new instance of itself using it’s current input parameters and let the current one end. Now we’ve created a simple framework to run background tasks at any rate without the need to manage infrastructure.

To summarize, we end up with these steps:

Pass the Lambda name or ARN as the task and the rate in seconds how often we want to call it
Kill any existing execution that is running for the same task to make sure there is just 1 instance running at the same time
Execute the actual task
Wait for the specific amount of time
As long as we’re comfortable from time and amount of steps perspective, repeat the above 2 steps
If we are no longer comfortable, create a new execution with the same parameters
Let current execution end

Solution in 2018

When I was challenged with this task it was 2018 I found the tutorial Continuing as a New Execution from AWS and that concept was the base for my solution. Basically you need a counter and increment it by 1 every time the task has been executed. I introduced a wait step which was waiting for the amount of seconds specific by the input. It would then continue to a Choice step where I check if the number is higher than a specific (hardcoded) amount of allowed iterations. If not, I would run the function again and increment the counter, wait and so forth. If we did reach the specific amount of allowed iteration, it would break out of the loop and run a task that triggers a Lambda to create a new execution with the same parameters and let the current one end.

Solution in 2021

While writing this article I realized there is a major downside to the above approach. The Lambda I wanted to execute periodically needed to increment the counter which makes it tightly coupled, not preferable at all. Also I didn’t, as in the tutorial from AWS, want to introduce a Lambda for just incrementing the counter as it would introduce extra cost and latency. Inspired by the article I linked earlier I decided to use the Map type (a feature not available back in 2018) too. When a new execution is started, I simply create an empty array with the amount of iterations I want to run and set the MaxConcurrency of the Map configuration to 1 so Step Functions will execute the inner state machine to call the Lambda and wait just one at a time. This removes the complexity of keeping the counter myself and allowing the Lambda to run without knowledge of the cron, perfect! The added bonuses are that there is no latency to increment the counter through Lambda and I can now take into account the wait time on to calculate how many executions it will run.

For your convenience I’ve made an example project on GitHub using Serverless Framework so you can try it for yourself.

Retries & Error Handling

Obviously there are things that can go wrong. The worst thing that can happen is that for some reason the Step Function execution stops or starting a new execution fails and the Lambda will be no longer triggered to run. Luckily Step Functions provided decent error handling and allows us to retry our states keeping failures to a minimum. As a second line of defense you could act on CloudWatch events and trigger a notification of some kind when an execution fails in a non-desired way.

Administration

It would be really nice to have some kind of administration tool where I can keep track of which tasks should be running at what rate and have the option to change rate and do manual starts and stops. I think storing this data in DynamoDB would be a great option as it’s easy to use and can be configured with On Demand billing mode so you don’t have to worry about capacity and cost.

Big thanks Zac Charles for giving me a fresh look on how to improve my initial solution. Shoutout to Michael Guntenaar for the amazing project to work on.

DEV Community

Dynamic Rate Cron using Step Functions

The Why

Concept

Solution in 2018

Solution in 2021

Retries & Error Handling

Administration

Top comments (0)

Read next

Things to consider before using EKS Auto Mode

Amazon Q Developer Tips: No.3 Enable Amazon Q Developer Workspace Index

Patterns of Enterprise Application Architecture-Day 4

Mastering CI/CD: How to Build a Robust Pipeline with GitHub Actions, Docker, and ECS