In this article I am going to describe a few approaches to deduplicate messages.
Although the article takes SQS as reference, the patterns can be applied for any service which provides at-least once delivery.
Duplicate messages are a common challenge when dealing with SQS and serverless architectures, but the good thing is that there are established patterns we can reuse.
I am going to talk about:
- SQS basics (very quick intro)
- Common pitfalls in deduping solutions
- Depuping Patterns
If you are only interested in the patterns, feel free to jump to the end of the article.
SQS is a queue service that enables async communication between services. It allows:
- decoupling services. Services communicate using the Queue as a channel. I.e. Service A sends messages to an SQS queue, Service B polls the queue for new messages; neither of the services need to know about each other.
- resiliency. The queue can serve as a buffer for outstanding requests. Service A sends messages to an SQS queue, Service B is temporary down (or slow). None of the messages will be lost, since the queues is effectively working as a buffer.
- scaling. The queue size can be monitored to trigger alarms or auto scaling events. Service A experiences a spike in traffic which causes many messages (more than usual) to be sent to the queue. Service B can define an alarm to automatically scale based on the number of unprocessed messages in the queue.
From the AWS docs:
Standard queues guarantee at-least-once delivery, storing messages across multiple servers to maintain high availability.
On rare occasions one of these servers might be unavailable at the time our message is consumed or deleted. Subsequently the server, when available again, may re-send the message the service.
The doc tells us a message could be consumed multiple times (although it's a rare occasion), but it does not mention that messages could be duplicated when pushed into the queue too.
That is the service(s) pushing messages in the queue, sends the same message multiple times. Avoiding this possibility is quite tricky and I think it's far more likely than the other source of duplication.
Unfortunately, SQS only provides tools to avoid this issue with the new SQS FIFO, via the so called SQS Message Deduplication ID.
One of the common solutions to dedupe messages is to keep track of them in a DynamoDB Table, establishing which are in progress, and which are complete.
When a consumer receives a message it needs to check if the message exists in the table.
If it does not exist, it processes it. If it does exist, it needs to establish whether the message is a duplicate or it's a retry caused by previous consumers failing processing it.
One way to do that is to keep track of the moment the item was created/updated in DynamoDB:
if the difference between the current system time and the item's updated timestamp exceeds the operation timeout then the message is a retry.
Unfortunately, the solution suffers from a serious flaw: the use of absolute clock time.
If different machines disagree about what time it is, they will end up processing the message multiple times.
Let's consider this example with Lambdas as SQS consumers:
- Lambda timeout is 5 minutes
- Lambda 1 gets the message and adds it to dynamo with its current time stamp.
- At about the same time Lambda 2 receives the same message. It detects that by looking up the item in DynamoDB.
- The status of the message is IN_PROGRESS, so Lambda 2 check if the difference between its system time and the item update time is greater than the lambda timeout.
- Because Lambda 2 is affected by a clock screw of 5 minutes, it will erroneously assume the message is a retry.
- Lambda 2 will process the message again (concurrently with Lambda 1)
The problem can be partly mitigated by tuning the threshold after which it is assumed to be safe re ingesting a duplicate message.
This threshold must take into account what’s the biggest clock screw we can expect (at least with high probability).
Solutions that relies on an external ledger to keep track of the messages, need to embrace the consistency model of the ledger they are going to use.
For example, when using DynamoDB as ledger, it is necessary to use:
- consistent reads and writes
- Conditional expression to deal with race conditions between multiple consumers
This turns out to be quite tricky to do (and test!).
Most of the solutions, end up writing code which effectively build both:
- a distributed lock built on top of a ledger
- a simple state machine backed by a ledger
These are very common problems, isn't there anything already build for that?
Make the consumers of the message idempotent. Message duplication is only an issue if the consumers are not idempotent.
If the activity the consumers perform is inherently non-idempotent, isolate the non-idempotent step from the idempotent ones.
For the non idempotent piece of the activity, use Step functions to create a fully managed and scalable state machine.
Step functions executions are idempotent, if StartExecution is called with the same name and input as a running execution, the call will succeed and return the same response as the original request.
If the execution is closed or if the input is different, it will return a 400 ExecutionAlreadyExists error.
If Step functions is not a good fit for you (costs?) use DynamoDB lock library to lock processing a message.
The library is battle-tested and provides a lock abstraction on top of DyanmoDB.
The library never stores absolute times in DynamoDB, only the relative "lease duration" time is stored in DynamoDB.
What this means is that, even if two different machines disagree about what time it is, they will still avoid clobbering each other's locks.
Strive to make your consumers idempotent and if unavoidable isolate the non-idempotent part.
Consider using Step Functions to dedupe executions. If you can't, use DynamoDB Lock library.
- If you are interested in the topic, there is an interesting article about the theory behind message deduplication via a deduplication id, like the one adopted by SQS FIFO. The technique has been called fencing: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
- DynamoDB Locks: https://github.com/awslabs/dynamodb-lock-client
- Step functions: https://aws.amazon.com/step-functions/