Avoid Costly Loops in AWS Step Functions

#aws #cloud #technology

We all played around with AWS StepFunctions in our careers. It is a fantastic orchestration tool! But there are some scenarios where your cloud bill can explode in your face! In this blog, I will walk you through how you can end up in this situation and, maybe more importantly, how to avoid it!

Looping

When you look at this pattern, you might think, okay, we wait for 5 seconds and use lambda to query something. If the response is true, we are done; if not, we will wait again and check the status once more. The problem with this pattern is that the response might never be true, causing an endless loop, which will generate cost.

There are no endless loops in StepFunctions

If you find yourself in such an “endless” loop, AWS will stop executing once you have exceeded 25,000 events. Even if you implement it, it will eventually be stopped by AWS. But each loop has 3 transitions, and you must pay for these transactions. They are pretty cheap, but the problem lies within this loop, especially if this pattern is part of a map distribution that allows you to do 10,000 parallel executions.

Assume your StepFunction is invoked based on an S3 event trigger. We upload 100 files in our S3 bucket. This will trigger 100 executions resulting in 3,000,000 state transitions each 5 seconds until they exceed the 25,000 events.

You can already see that this will ramp up your AWS bill.

The callback pattern

You can solve this by implementing a callback pattern. It’s quite simple: You will place a message in an SQS Queue. You can then use any consumer to do the task at hand. When it’s done, the consumer notifies the step function execution and passes the result back.

This is a known pattern, but we design our systems for failure! The cool thing about a SQS Queue is its retry mechanism.

Now, we can design it as follows:

The StepFunction will place the task in the SQS Queue.
The Processor will receive a batch of messages
The function will process each message, and for each, it will:
1. Send a heartbeat to the StepFunction execution. This prevents the StepFunction from timing out.
2. If the message is invalid, we can fail the StepFunction execution directly.
Depending on the outcome, the function will:
1. When a failure occurs, the message will be kept in the queue.
2. This will trigger the retry mechanism based on the visibility settings on the queue.
3. When the message is processed successfully, we will:
4. Send a success to the StepFunction execution.
5. Remove the message from the queue.
When the message has reached the maximum number of retries, it will be delivered to the dead letter queue.
The Dead Letter Queue Processor will receive the message.
It will fail the StepFunction execution with some additional context.
Remove the message from the dead letter queue.

Why is this a good design? It accounts for failure! We left the actual processing job out of scope here. However, the lambda processing can fail for many reasons:

It can reach AWS Service limits.
It might depend on another external system that is unavailable.
You’re scaling so fast that the serverless components can’t handle the load yet.

You have a better chance of succeeding with the SQS Queue and the retry mechanism. If the retries are exhausted, the dead letter queue will mark the failure in the step function itself. This will keep all the execution information in a single place for you to analyze.