Nicola Apicella

Posted on Jun 22, 2019 • Edited on Jul 7, 2019

AWS SQS - Deduplicating messages

#aws #serverless #cloud #showdev

In this article I am going to describe a few approaches to deduplicate messages.
Although the article takes SQS as reference, the patterns can be applied for any service which provides at-least once delivery.

Duplicate messages are a common challenge when dealing with SQS and serverless architectures, but the good thing is that there are established patterns we can reuse.
I am going to talk about:

SQS basics (very quick intro)
Common pitfalls in deduping solutions
Depuping Patterns

If you are only interested in the patterns, feel free to jump to the end of the article.

SQS basics

SQS is a queue service that enables async communication between services. It allows:

decoupling services. Services communicate using the Queue as a channel. I.e. Service A sends messages to an SQS queue, Service B polls the queue for new messages; neither of the services need to know about each other.
resiliency. The queue can serve as a buffer for outstanding requests. Service A sends messages to an SQS queue, Service B is temporary down (or slow). None of the messages will be lost, since the queues is effectively working as a buffer.
scaling. The queue size can be monitored to trigger alarms or auto scaling events. Service A experiences a spike in traffic which causes many messages (more than usual) to be sent to the queue. Service B can define an alarm to automatically scale based on the number of unprocessed messages in the queue.

Message duplication

From the AWS docs:

Standard queues guarantee at-least-once delivery, storing messages across multiple servers to maintain high availability.

On rare occasions one of these servers might be unavailable at the time our message is consumed or deleted. Subsequently the server, when available again, may re-send the message the service.

The doc tells us a message could be consumed multiple times (although it's a rare occasion), but it does not mention that messages could be duplicated when pushed into the queue too.
That is the service(s) pushing messages in the queue, sends the same message multiple times. Avoiding this possibility is quite tricky and I think it's far more likely than the other source of duplication.

Unfortunately, SQS only provides tools to avoid this issue with the new SQS FIFO, via the so called SQS Message Deduplication ID.

Common pitfall in deduping solutions

...about the use of absolute time

One of the common solutions to dedupe messages is to keep track of them in a DynamoDB Table, establishing which are in progress, and which are complete.

When a consumer receives a message it needs to check if the message exists in the table.

If it does not exist, it processes it. If it does exist, it needs to establish whether the message is a duplicate or it's a retry caused by previous consumers failing processing it.

One way to do that is to keep track of the moment the item was created/updated in DynamoDB:
if the difference between the current system time and the item's updated timestamp exceeds the operation timeout then the message is a retry.
Unfortunately, the solution suffers from a serious flaw: the use of absolute clock time.
If different machines disagree about what time it is, they will end up processing the message multiple times.
Let's consider this example with Lambdas as SQS consumers:

Lambda timeout is 5 minutes
Lambda 1 gets the message and adds it to dynamo with its current time stamp.
At about the same time Lambda 2 receives the same message. It detects that by looking up the item in DynamoDB.
The status of the message is IN_PROGRESS, so Lambda 2 check if the difference between its system time and the item update time is greater than the lambda timeout.
Because Lambda 2 is affected by a clock screw of 5 minutes, it will erroneously assume the message is a retry.
Lambda 2 will process the message again (concurrently with Lambda 1)

The problem can be partly mitigated by tuning the threshold after which it is assumed to be safe re ingesting a duplicate message.
This threshold must take into account what’s the biggest clock screw we can expect (at least with high probability).

...about the ledger consistency

Solutions that relies on an external ledger to keep track of the messages, need to embrace the consistency model of the ledger they are going to use.

For example, when using DynamoDB as ledger, it is necessary to use:

consistent reads and writes
Conditional expression to deal with race conditions between multiple consumers

This turns out to be quite tricky to do (and test!).

...about reinventing the wheel

Most of the solutions, end up writing code which effectively build both:

a distributed lock built on top of a ledger
a simple state machine backed by a ledger

These are very common problems, isn't there anything already build for that?

Deduping patterns

Idempotency

Make the consumers of the message idempotent. Message duplication is only an issue if the consumers are not idempotent.
If the activity the consumers perform is inherently non-idempotent, isolate the non-idempotent step from the idempotent ones.

Step functions

For the non idempotent piece of the activity, use Step functions to create a fully managed and scalable state machine.
Step functions executions are idempotent, if StartExecution is called with the same name and input as a running execution, the call will succeed and return the same response as the original request.
If the execution is closed or if the input is different, it will return a 400 ExecutionAlreadyExists error.

Dynamo DB Lock

If Step functions is not a good fit for you (costs?) use DynamoDB lock library to lock processing a message.

The library is battle-tested and provides a lock abstraction on top of DyanmoDB.
The library never stores absolute times in DynamoDB, only the relative "lease duration" time is stored in DynamoDB.
What this means is that, even if two different machines disagree about what time it is, they will still avoid clobbering each other's locks.

Conclusions

TL;DR
Strive to make your consumers idempotent and if unavoidable isolate the non-idempotent part.
Consider using Step Functions to dedupe executions. If you can't, use DynamoDB Lock library.

Readings

If you are interested in the topic, there is an interesting article about the theory behind message deduplication via a deduplication id, like the one adopted by SQS FIFO. The technique has been called fencing: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
DynamoDB Locks: https://github.com/awslabs/dynamodb-lock-client
Step functions: https://aws.amazon.com/step-functions/

You can find me on twitter and github.

Have fun!

Top comments (3)

Alexandru Antochi • Jul 9 '19 • Edited

What about consuming the message and if success, then check the DB for that message and if not present, add message to DB and return data.

Thus if there was a failure the message will not be registered in the DB and there is no need to compare timestamps.

It adds extra processing (consuming the message and then checking if it's a duplicate), but if duplicates are rare, it would it make the system more reliable by eliminating the timestamps?

Nicola Apicella • Jul 11 '19

Hi! The communication is asynchronous, some client(producer) sends a message to a queue and some server(consumer) pull it from the queue.
If you have multiple consumers, they might consume the same message concurrently, which you can avoid by storing a message id in a table.
But you also want to deal with failures in the consumer (consumer gets the message but fails midway while processing it). All this logic can be abstracted away to some extend by using Step functions or DynamoDB locks.

Mario Bittencourt • Jul 15 '20

Hi Nicola, can you provide more details on why (and how) the use of the step function would solve the problem for non-idempotent consumers/handlers of the messages that can be duplicated?