DEV Community

Brandon Gautama
Brandon Gautama

Posted on

Message Delivery System using AWS Step Functions

Background

My team owns a cloud microservice that orchestrates the collection of content from other microservices and delivers them to customer-facing devices.

Problem

Recently, I was tasked to design a cloud system that can ensure delivery of a given message to any target device. A delivery is successful if the target device sends back an ACK (acknowledgement) response.

Normally, this would be an easy task. If the message is dropped in the communication channel or if the target device is offline or behaving incorrectly, the service can just retry until it receives the ACK. Since most programming languages (e.g. Java's built-in HttpClient) supports this out of the box, this retry system can be easily built.

However, this is only true for synchronous communications where a client opens a connection channel, sends a request and awaits a response through the same channel. If there is no response in that channel or a bad response was received in that channel, the service knows that it has to retry. Things get complicated in an asynchronous system, where the response comes back through a different channel than where the request was sent. (How to schedule the next attempt?)

In addition, the retry has to be attempted with backoffs up to 'seconds' precision. That is, after the first attempt, wait for 15 seconds before the second attempt, then wait for 30 seconds before the third attempt, then additional attempts has to wait for 4 hours.

Design

Given the above requirements, we know we needed a scheduler.

Something similar to CronJob would work so we looked for AWS products that can simulate this 'scheduling' behavior.

We first looked at AWS EventBridge Scheduler but it could only support up to the 'minute' precision, while we needed something with 'seconds' accuracy.

We then came across AWS SQS. SQS has a feature called 'message delay' where a message in the queue is only visible after the delayed time set in seconds. So we took advantage of that delay as a way to schedule the next retry. Here is how it works:

Scheduler System using AWS SQS

  1. The first attempt to deliver the message to device is accompanied with a message to the 15-seconds queue.
  2. When the message from 15-seconds queue becomes visible (after 15 seconds), we first check whether the message has been ACKed. If it hasn't been ACKed, attempt another message delivery to device while at the same time, enqueue the message into the 30-seconds delay queue. If it has been ACKed, the process stops here.
  3. When the message from 30-seconds queue becomes visible (after 30 seconds), we first check whether the message has been ACKed. If it hasn't been ACKed, attempt another message delivery to device while at the same time, enqueue the message into the 4-hours delay queue. If it has been ACKed, the process stops here.
  4. The above repeats until we received the ACK.

Although the above works, note that for every message, there is always one extra traffic. For example, if the message was ACKed at the first attempt, no retry is needed. Yet, the message is already enqueued into the 15-seconds delay queue and when the message becomes visible, we check that the message has been ACKed and therefore, we do not attempt another delivery to device. That is wasted traffic. It felt like SQS may not be the right solution. We were repurposing SQS to force-fix our problem with its 'message delay' feature and its 'seconds' precision. So we decided to keep looking for better alternatives.

At that time, we recognized that the scheduler has this notion of 'states'. It is behaving differently based on different states, i.e. second attempt has to wait for 15 seconds while third attempt has to wait for 30 seconds. Depending on the number of attempts made, the scheduler has to set different delay values. Thats when we realized the problem can be generalized as a Finite State Machine (FSM). Step Functions to the rescue!

Message Delivery Workflow

Here is how it works:

  1. The machine starts with first message delivery using API Gateway: Invoke
  2. The machine then goes into a 'Wait' state of 15-seconds
  3. After 15-seconds timer is up, the machine proceeds to the 'Choice' state and check whether message has been ACKed. If yes, then the machine goes to 'End' state. Else, the machine attempts another message delivery by going back to the API Gateway: Invoke step above.
  4. Step 2 and 3 repeats with 30-seconds timer
  5. Step 2 and 3 repeats with 4-hours timer

Note that we can simplify this further. If we add an additional behavior when we receive the ACK from device to stop this step function execution, we can remove the 'Choice' state. In other words, this execution only exists if we have not received the ACK.

Simplified Message Delivery Workflow

Summary

AWS Step Functions is a workflow orchestration system. It can be a very powerful tool, especially when solving problems related to scheduling and state machines workflows.

It can be used to support both event-driven and schedule-based architectures. Its SDK Integrations to other AWS Services eases development especially if you're already building on AWS infrastructure.

I encourage you to check their official documentation. (Surprisingly?) they can be used to solve a lot of design problems!
AWS Step Functions

Top comments (0)