What countermeasures do you use for when an ack message (whether RabbitMQ, SQS, or other) is lost due to either a network fault (or any other reason why a broker is unreachable) and a message is resent. How do you avoid processing the message a second (or more) time?
My name is Matteo and I'm a cloud solution architect and tech enthusiast. In my spare time, I work on open source software as much as I can. I simply enjoy writing software that is actually useful.
It depends on the way the message is processed.
Is it some kind of background job (e.g. updating a DB table or uploading a file to an object storage)? Then do your best to encapsulate the computation in a single, atomic transaction. It either succeeds entirely or fails entirely. Acks will be sent back once the transaction is successfully completed. The chances of an acknowledgement being lost at this point are very low, but a best practice is to have some kind of "commit" functionality, so you can process your job, send the ack and then commit when the broker confirms. This way, if something happens and the message cannot be acknowledged, you simply rollback the transaction and re-enqueue the job. This is basically what job queues like Sidekiq do.
Is it some kind of RPC? Then you don't protect against duplicate messages. If an acknowledgement is lost, it means that the call response will probably never be delivered, so it is better to re-enqueue the message and potentially duplicate it than risk loosing it.
I'm specifically talking about service architecture, and not background jobs. I recognize that the two are only tangentially related. Definitely not talking about RPC.
So, distributed transactions that span a message acknowledgement and whatever actions a service takes are anathema to service architecture. They'd also increase the likelihood of dead locks, which would contradict the non-locking benefits of the architecture.
The three commonly-recognized guarantees of distributed, message-based systems are that messages will arrive out of order, that messages will not arrive at all, and that message will arrive more than once. This includes ACK signals - especially with regard to messages not arriving at all.
Irrespective of whether it happens infrequently, it will happen. Whether it happens one in a million times or a million in a million times, the work of implementing the countermeasures is the same. The presence of networks and computers and electricity guarantees that ACK messages will be lost and that messages will have to be reprocessed (messages will arrive more than once).
So, what I'm interested in is how you specifically account for the occurrence of message reprocessing that messaging systems guarantee.
Here's my understanding: microservices have smart endpoint and dumb pipes. So it's the service's responsibility to account for message redeliveries like you already pointed out.
One way out would be to make the message processing operation idempotent. Two cases here:
The operation is idempotent in and of itself. Doing nothing here will cause no side-effects.
The operation is not idempotent in nature. Use an identifier (eg. a random unique guid, or hash of the message, etc.) to ensure a message is processed once per request.
Indeed. However, a unique message identifier would have to be persisted in order to know that it has been previously processed. Where possible, and where the messaging tool is homogenous (within a logical service, rather than integrating disparate logical services), then a message sequence number can be a better option (especially in cases where event sourcing is in use).
What countermeasures do you use for when an ack message (whether RabbitMQ, SQS, or other) is lost due to either a network fault (or any other reason why a broker is unreachable) and a message is resent. How do you avoid processing the message a second (or more) time?
It depends on the way the message is processed.
Is it some kind of background job (e.g. updating a DB table or uploading a file to an object storage)? Then do your best to encapsulate the computation in a single, atomic transaction. It either succeeds entirely or fails entirely. Acks will be sent back once the transaction is successfully completed. The chances of an acknowledgement being lost at this point are very low, but a best practice is to have some kind of "commit" functionality, so you can process your job, send the ack and then commit when the broker confirms. This way, if something happens and the message cannot be acknowledged, you simply rollback the transaction and re-enqueue the job. This is basically what job queues like Sidekiq do.
Is it some kind of RPC? Then you don't protect against duplicate messages. If an acknowledgement is lost, it means that the call response will probably never be delivered, so it is better to re-enqueue the message and potentially duplicate it than risk loosing it.
I'm specifically talking about service architecture, and not background jobs. I recognize that the two are only tangentially related. Definitely not talking about RPC.
So, distributed transactions that span a message acknowledgement and whatever actions a service takes are anathema to service architecture. They'd also increase the likelihood of dead locks, which would contradict the non-locking benefits of the architecture.
The three commonly-recognized guarantees of distributed, message-based systems are that messages will arrive out of order, that messages will not arrive at all, and that message will arrive more than once. This includes ACK signals - especially with regard to messages not arriving at all.
Irrespective of whether it happens infrequently, it will happen. Whether it happens one in a million times or a million in a million times, the work of implementing the countermeasures is the same. The presence of networks and computers and electricity guarantees that ACK messages will be lost and that messages will have to be reprocessed (messages will arrive more than once).
So, what I'm interested in is how you specifically account for the occurrence of message reprocessing that messaging systems guarantee.
Here's my understanding: microservices have smart endpoint and dumb pipes. So it's the service's responsibility to account for message redeliveries like you already pointed out.
One way out would be to make the message processing operation idempotent. Two cases here:
Your thoughts?
Indeed. However, a unique message identifier would have to be persisted in order to know that it has been previously processed. Where possible, and where the messaging tool is homogenous (within a logical service, rather than integrating disparate logical services), then a message sequence number can be a better option (especially in cases where event sourcing is in use).
The message sequence number is interesting, thanks!