Scaling AWS FIFO SQS (queues) without blocking customers

#aws #cloud #cloudcomputing #awscommunity

Let’s imagine you have two services (for example, on AWS Lambda). Service 1 communicates with Service 2 via commands: messages in FIFO SQS (commands VS events).

FIFO (First-In-First-Out) is necessary in this situation to ensure the order of command processing in Service 2 (regular AWS SQS does not guarantee the order of the message queue, and causes duplication).

To achieve parallelism, you can use message groups (message group ID), where each group is processed sequentially, but several groups are processed in parallel. This allows you to have separate “mini-queues” for different message sources, for example, for each customer.

Let’s imagine you have 2 customers, for each of which you transfer messages from Service 1 to Service 2:

Customer A: 1,000 commands per hour.*Customer *B: 500,000 commands per hour.

Each of them has its own message group ID (which ensures that they do not block each other). Service 2 can process approximately 100,000 commands per hour. Thus, Customer A’s messages should, in theory, be processed within that hour, as they are not blocked.

This would indeed be the case if it weren’t for the AWS FIFO SQS limitation: the queue only looks at the first 120,000 messages (until recently it was 20,000) to find available groups for processing.

Solution

In this case, it is necessary to create an architecture that will not allow one Customer to block another while maintaining the conditional order of messages. There are several solutions to this:

Store messages in additional storage if SQS has more than X messages in the queue
Make the processing speed on Service 2 faster
Dynamically create an additional FIFO SQS for a group of messages with high traffic

I will describe the last scenario below, as I have experience creating this type of architecture.

Architecture description

Tracking SQS load:

Each message adds a counter in DynamoDB for the corresponding userId.
After processing the message, Service 2 sends an SNS message about the completion of processing, to which Service 1 is subscribed — the counter in DynamoDB decreases.

Separate queue allocation:

If a Customer exceeds X (e.g., > 100,000) active messages, a separate FIFO SQS is created for it (using AWS SDK SQS Client).
All new messages from this Customer are redirected to this queue.
A separate trigger for Lambda is added to Service 2, which processes this queue (using AWS SDK Lambda client). PS: this fact violates the SoC principle, since the trigger for Lambda belongs to Service 2. However, you can optimise the architecture and create a separate service for monitoring SQS and creating triggers for Lambda.

Cleaning

All records in DynamoDB have a TTL of 1 day.
After it expires, the record is deleted.
The deletion event triggers Lambda
The newly created FIFO SQS queue and its Lambda trigger are deleted.

Pros

Real-time SQS load monitoring
Guaranteed non-blocking for customers with moderate traffic

Cons

An additional DynamoDB database and business logic on Lambda requires implementation, maintenance, and costs
Simpler implementations based on CloudWatch Alarms are possible, but they will have a delay in creating additional AWS FIFO SQS

Violation of the SoC (Separation of Concerns) principle: because Service 1 creates Lambda triggers in Service 2.

Conclusion

If you have a system where different customers can have very different loads, but you need to maintain message order, it is important to have a mechanism for dynamic traffic separation. The FIFO limit of 120,000 messages is not obvious but critical.

This approach ensures stability and independence between Customers even during peak periods.