Matteo Joliveau

Posted on Feb 23, 2018

Microservices communications. Why you should switch to message queues.

#microservices #cloud #amqp #rest

This is actually a more elaborate version of an old comment I wrote here

When doing microservices, a fairly crucial design point is: how should my services communicate with each other?
Many people usually choose to design some RESTful HTTP API that each service expose and then have the other services invoke it with a normal HTTP client.
This has some advantages, primarily making it easy to discover services by using DNS resolution and API Gateways, but it has many drawbacks too.
For example, what if the called service has crashed down and cannot respond? Your client service has to implement some kind of reconnection or failover logic, otherwise, you risk to lose requests and pieces of information. A cloud architecture should be resilient and recover gracefully from failures. Also, an HTTP request is a blocking API, making it asynchronous implies some tricky shenanigans on the client side.

Entering message queues

Using message queues is actually a fairly old solution when dealing with multiple intercommunicating services. Think about old approaches like Enterprise Service Buses or the Java JMS specification.

Let's start by introducing some terminology:

Message: a package for information, usually composed of two parts. Some headers, key-value pairs containing metadata, and a body, a binary package containing the actual message itself.
Producer: whoever creates and sends a message
Consumer: whoever receives and reads a message
Queue: a communication channel that enqueues messages for later retrieval by one or more consumers
Exchange: a queue aggregator that abstract away message queues and routes messages to the appropriate queue based on some predefined logic.

A message queue architecture requires an additional service called a message broker that is tasked with gathering, routing and distributing your messages from senders to the right receivers.

Think of it as a mail service. You (the producer) are sending a letter (your message) to someone (the consumer), and you do this by specifying the address (the routing logic for the message, such as the topic on which it is published) and by giving the letter to the local Post office (the message broker). After you deliver the letter, it's no longer your business to ensure that it actually arrives at your friend. The postal service will take care of that.

In a microservice environment, it's actually a really solid solution because it adds a layer of abstraction (the message broker itself) between loosely coupled services and allows for fully asynchronous communications.

Errors will occur. Let's deal with them instead of simply avoiding them

Remember, a cloud environment should be error-resilient and services must fail gracefully and not take down the whole application if one of them dies. With message queues if a microservice dies unexpectedly the broker can still receive incoming messages, store them for later (so that the dead service can recover them when it comes back online) and optionally send back a Last Will and Testament message to the other service saying that the receiver has died. Also, Push/Pull, Publish/Subscribe and Topic queue mechanisms give more flexibility and efficiency when designing inter-service communications, and the ability to send binary payloads allow for easy use of formats like Google ProtoBuff or MessagePack instead of JSON, which are much more efficient in terms of bandwidth usage.

Security is key

Advanced broker servers like RabbitMQ support multiple message protocols (AMQP, MQTT, STOMP...), have flexible authorization mechanism (access control, virtual hosts and even third party/custom auth services via many different transports such as HTTP or AMQP) and will basically remove the burden of orchestrating requests authorization from your service. Plug a custom microservice into Rabbit's auth backend and code your policies in there. The rest will be taken care of by the broker.

Infrastructure at scale

Using message queues with brokers like Rabbit helps a lot with scalability too, which is another crucial aspect of microservices.
If one of the services is struggling under load we want to be able to spin up more instances quickly and without having to reconfigure stuff around. With HTTP as our communication method we would normally have a self-registering Service Discovery Server (like Netflix Eureka or a container orchestration system like Kubernetes or Rancher) and some kind of load balancer integrated with it so that our traffic is distributed and routed to the various instances. With message queues, our broker IS the load balancer. If multiple consumers are listening to a topic at the same time messages will be delivered following a configured strategy (more on RabbitMQ's QoS here).

So, for example, we can have four instances of our service processing messages in a round robin fashion, but since the QoS parameter is customizable at runtime we could also switch to a fair dispatching strategy if one of the consumers is taking too long to complete its work and we need to balance the load out. Also, note that the client configuration is entirely done at runtime (and so is exchanges, queues and topics declarations by the way) so there is no need to tinker with the broker's configuration or restart anything.

Async == Efficiency

We said at the beginning that one of the key features of messages compared to HTTP requests is that they allow for fully async communications. If I need to make a request to another service I can send a message on it's topic and then move on with my work. I can deal with the response when it comes in on my own topic, without having to wait for it. Correlation IDs are used to keep track of which response message refers to which request. This works in concert with what we said about resiliency.
In an HTTP scenario, if I make a request and the callee is down, the connection would be refused and I would have to try again and again until it comes back up.
With message systems, I can send my request and forget about it. If the callee cannot receive it the broker will keep the message in the queue, and then deliver it when the consumer connects back. My response will come back when it can and I won't have to block or wait for it.

In conclusion, messages are great. Show them some love

This is by no mean a rant against HTTP (which is still great, especially for public front facing APIs) but rather a try to guide you towards a more efficient and proficient strategy to coordinate microservice communications and all of this can be achieved with open source software, open protocols such as AMQP that prevent vendor lock-in and with a low (and scalable) infrastructure cost (see RabbitMQ's requirements).

I used RabbitMQ for all my microservices project and it has been a great experience. Let me know in the comments what are your thoughts! Have you used message queues before? What is your experience with it?

Top comments (30)

Scott Bellware • Jun 9 '18

What countermeasures do you use for when an ack message (whether RabbitMQ, SQS, or other) is lost due to either a network fault (or any other reason why a broker is unreachable) and a message is resent. How do you avoid processing the message a second (or more) time?

Matteo Joliveau • Jun 11 '18 • Edited

It depends on the way the message is processed.
Is it some kind of background job (e.g. updating a DB table or uploading a file to an object storage)? Then do your best to encapsulate the computation in a single, atomic transaction. It either succeeds entirely or fails entirely. Acks will be sent back once the transaction is successfully completed. The chances of an acknowledgement being lost at this point are very low, but a best practice is to have some kind of "commit" functionality, so you can process your job, send the ack and then commit when the broker confirms. This way, if something happens and the message cannot be acknowledged, you simply rollback the transaction and re-enqueue the job. This is basically what job queues like Sidekiq do.

Is it some kind of RPC? Then you don't protect against duplicate messages. If an acknowledgement is lost, it means that the call response will probably never be delivered, so it is better to re-enqueue the message and potentially duplicate it than risk loosing it.

Scott Bellware • Jun 11 '18 • Edited

I'm specifically talking about service architecture, and not background jobs. I recognize that the two are only tangentially related. Definitely not talking about RPC.

So, distributed transactions that span a message acknowledgement and whatever actions a service takes are anathema to service architecture. They'd also increase the likelihood of dead locks, which would contradict the non-locking benefits of the architecture.

The three commonly-recognized guarantees of distributed, message-based systems are that messages will arrive out of order, that messages will not arrive at all, and that message will arrive more than once. This includes ACK signals - especially with regard to messages not arriving at all.

Irrespective of whether it happens infrequently, it will happen. Whether it happens one in a million times or a million in a million times, the work of implementing the countermeasures is the same. The presence of networks and computers and electricity guarantees that ACK messages will be lost and that messages will have to be reprocessed (messages will arrive more than once).

So, what I'm interested in is how you specifically account for the occurrence of message reprocessing that messaging systems guarantee.

Galdin Raphael • May 15 '19

Here's my understanding: microservices have smart endpoint and dumb pipes. So it's the service's responsibility to account for message redeliveries like you already pointed out.

One way out would be to make the message processing operation idempotent. Two cases here:

The operation is idempotent in and of itself. Doing nothing here will cause no side-effects.
The operation is not idempotent in nature. Use an identifier (eg. a random unique guid, or hash of the message, etc.) to ensure a message is processed once per request.

Your thoughts?

Scott Bellware • Jun 6 '19

Indeed. However, a unique message identifier would have to be persisted in order to know that it has been previously processed. Where possible, and where the messaging tool is homogenous (within a logical service, rather than integrating disparate logical services), then a message sequence number can be a better option (especially in cases where event sourcing is in use).

Galdin Raphael • Jun 6 '19

The message sequence number is interesting, thanks!

Gergely Mészáros • May 21 '20

Maybe I misunderstood what you want to say, but I can't see the inherent superiority. Here you seems to be talking about different architectural designs and all the shortcomings/profits are direct result of the architecture not the protocol used. Message queue broker is not really async (at least at protocol level) since the messages must be ack-ed. The message broker will become your single point of failure. So what did we achieve? We transferred the whole problem to some (arguably more regulated) dedicated service.

I think the real difference between the two is that HTTP (originally) was not designed to handle "push" notification therefore only one side could initiate the information xchange. Using HTTP/2 push or websockets we could easily implement full blown messaging over http (and still use REST).

It seems you implicitly suppose the service must immediately execute the whole business logic in response of REST request. That's not true. REST is about state transfer, not about business logic. If it's a HTTP REST (most common) we even have standard response code for delayed execution: HTTP 202.
It's completely valid to send POST: BeginMyTask: uuid and getting HTTP 202 "Task started." and later getting some notification of the finish. Just like you would do in case of message broker.

We may have bad architecture and good architecture. It might be harder to create bad architecture with message broker but it's still possible. However if I try to solve everything with message brokers, the threat of over-engineering is high and I may get inferior design.

For example if I have 5 services communicating independently over REST calls I have a very resilent network without single point of failure. If one service goes down, some feature will be inaccessible but the system will run happily. If I have a message broker and the message broker goes down... end of story.

Neither method is superior, everything depends on what would you like to achieve. What do you think?

Ferdinand Mütsch • Mar 12 '18

Great article with good arguments! I really like the asynchronous messaging concept for inter-service communication.
Lately, I went to a talk by Spring Data project lead Oliver Gierke. He presented some kind of a hybrid approach between Microservice- and monolith architectures as well as a pretty interesting communication concept, which somehow is a mixture of direct API invocation and messaging. If you're interested, here's a recording.

Matteo Joliveau • Mar 14 '18

That's a very interesting talk! Thanks for sharing it :D

Aldo Sinanaj • Feb 25 '18

Hi Matteo, nice article! I have a question about correlation id. Where do you store it so you can keep track of it once you get a response on the response topic? I was wondering if you store it in a memory cache it will not be reliable if the producer will crash, or if you have more running instances of this component then you need a shared cache?! Can you share your experience how you have deal with this scenario? Thanks, Aldo

Matteo Joliveau • Feb 26 '18

Hi Aldo, thanks for the question, it's actually a very good point I missed in the article!
Correlation IDs are only useful if someone is actually waiting for a response in particular. Since most microservices are stateless and routing is taken care of by the broker itself, the Correlation ID can be kept in the message headers and passed around (and logged, always log CorIDs!) between services without them caring much about it. The only one that needs to keep track of it is the request initiator (e.g. the HTTP API Gateway). It will create an ID and assign it to the request message, then wait on its response topic for a message with the same ID.
The other services will not need to care about it since the topic on which to send responses to is provided by the request message itself (in the reply_to RabbitMQ property for example) so every req/res is self-contained.
You receive a request, you process it and you send it to the provided topic.

On the other hand, if you need to contact another service in order to answer a request you just received, you become an initiator yourself, so when you send your secondary request you can use the same Correlation ID when waiting for the response.

Now to actually answer your question, the IDs you are traking can be kept in memory (normally you use a thread pool or an event loop to deal with requests/responses so the ID can be maintained there) or you can use a cache like Redis, but by no means, it must be shared. Remember that microservices are independent and must only care about themselves.

I will update the article to add this and other enhancements :)

Aldo Sinanaj • Feb 27 '18

Thanks for the reply! Waiting for the update :-)

Bruno Oliveira • Oct 11 '19

Hi Matteo! Very nice article we have here, thank you for it.

I'm starting to study strategies for microservices and async based communications and I have some questions that, maybe, you can help me with.

The client should connect directly to the broker and send/receive messages? Is this a good or bad practice? What are your thoughts on security regarding the connection being "available" for the client (on the browser)?
If I decide to make an API that I call and the API sends the message to the broker, how should the client receive the message? Does the API have to hang the request while consuming the broker until receives a message that IS the waited response?

I'm kinda stuck into how this flux should be. If you have any resources that I could read/watch on this subject (some github repos, maybe), I believe they can be very helpful too.

Matteo Joliveau • Oct 11 '19 • Edited

Hi Bruno, thank you for reading it!

The client COULD connect directly to the broker and use an async API to consume data from the backend, for example by using a protocol such as MQTT (which RabbitMQ supports) that is designed for transmission over unstable networks. But I generally do not proceed this way, and I instead implement a "gateway" service (for example exposing a GraphQL o REST API) that will translate the requests to async messages.
Yes, the API has to wait for a response to come back on the reply queue. While this sounds bad and potentially slow, I assure you it is not (normally). Actually, with the correct setup I often see this approach being faster than regular HTTP calls between microservices, being AMQP capable of transmitting binary payloads (compared to text for HTTP/1.1) and the generally lower overhead introduced by AMQP compared to HTTP.
From the client point of view, it's just an API call and it will return with a response in acceptable time. Nonetheless, whenever you have to wait for a reply you should implement timeouts so that you don't hang indefinitely. Maybe the calling service has gone away or has crashed, the response may never arrive. So as for HTTP, request timeouts are very important.

TARUN GOEL • Aug 24 '18

Hi Matteo, great article! I have a question that in what format messages must be communicated among the microservices. I am currently using hashmaps or json strings and dont think they are the best industrial practices. Is there anything you could advice.

Matteo Joliveau • Aug 24 '18

If you're using brokers that support binary payloads, I recommend you look at MessagePack if you want a schema-less format (very akin to JSON). If on the other hand, you want to have a strongly typed message schema, with known data structures very similar to programming language classes, then Protocol Buffers or Apache Thrift are much more robust solutions.

I recommend the latter if you have many different microservices. MessagePack is easier but can be less ideal when scaling.

Dorian Neto • Dec 25 '20

How the producer know that a specific or all messages were processed by consumer? I mean, how it works the consumer's response?

Rafael Jesus • Mar 2 '18

Thanks for the article, quick question, how do you produce messages when the Broker is unavailable? Yes I know there are several ways to manage this, I am just curious how you're doing ^{_^}

Matteo Joliveau • Apr 4 '18 • Edited

Ideally, your broker will never be unavailable.
I know, this is silly because we don't live in an ideal world, but you should consider your broker a supporting service similar to a database or email sender.
In a cloud environment it is best to have the broker set up as a managed service (e.g. Amazon SQS) or in a high-availability cluster separated from your core application (Rabbit supports clustering as a primary feature).

If by any chance your broker actually went down, it is an unexpected crisis that must be dealt with depending on your project and infrastructure. Your services are allowed to ungracefully die (but it is always best to implement some failover, by logging the issue and halting operations while the service attempt to reconnect, maybe with some exponential backoff) in this case because it's a disaster situation and not an operational incident.