I stumbled upon a blog post by Qiscus.com that talks about the architecture that lets them handle 10 million concurrent connections and throughput of 5 million messages per minute. This is the high-level Qiscus architecture shared in the blog post, which jives with what I learned from a talk their CTO Evan Purnama gave in Jakarta a couple years ago.
Pretty cool, right?
In that talk, Evan spoke about the architecture and design decisions, and the post has some of the information there as well. Here’s a few of the key points I noted from this architecture.
- MQTT was chosen as the fronting layer because it’s lightweight, requires minimal client resources, and topics can be very specific, and support wildcard patterns.
- They opted to decouple the layer that handles the huge incoming traffic and the layer that services the many different consumers of those events for performance and independent scalability reasons.
- Message persistency is required so the system can tolerate failed consumers in a way that lets reconnected consumers pick things up from the last message they did receive, and absorb bursts of traffic in a way that lets downstream apps consume incoming data at whatever pace works for them.
- Kafka was chosen for the consumption layer because of its scalability and replay capability.
Looking at the requirements and design decision assumptions, I had been thinking if I could come up with a different approach. After a little thinking and a lot of procrastination, here’s what I came up with. Naturally, I’m using Solace PubSub+ Platform as the event platform.
Figure 2: Alternative architecture with a Solace-enabled event mesh.
With this alternative architecture, let’s start with the MQTT requirement. Solace PubSub+ Event Broker supports MQTT 3.1 and 5, along with REST and WebSocket, all without any add-ons or proxies. Events coming in via MQTT can be consumed in any other protocols/APIs used by the consumers on the right hand side, such as JMS, AMQP, WebSocket, MQTT, or even via a REST webhook mechanism.
Message persistence is taken care of with the guaranteed messaging capability of Solace PubSub+ Event Broker, and yes it can do replay as well. No biggies there. In fact, shock absorption and lossless guaranteed delivery is one of the things Solace is known for by our customers globally.
The key distinction between Qiscus’ architecture and the one I’ve presented here is that there is logically just one fabric for the messaging platform, something called an “event mesh” — like service mesh but for events instead of services. With an event mesh, there is no need for applications to move and/or translate the events from connectivity layer to the processing layer.
This architecture is still decoupled for performance and scalability, by creating a layered architecture for the connectivity side and the processing/consumption side. The connectivity layer consists of multiple PubSub+ brokers deployed as the needed scale as well as location requirements. After all, these brokers can run on most if not all public and private clouds, containers, or virtualization platforms.
The connectivity layer then streams the events to the processing layers in a smart, efficient, and guaranteed manner, so we don’t lose messages along the way. The processing layer then scales as needed to stream those events to the back-end applications.
One major thing to note here ̶ these back-end applications have all the flexibility of the topic subscriptions with wildcards just like the one you see with MQTT protocol. If you don’t really get what this means, it could be because you are led to believe topic creation is required before anything else, and that it is expensive, and you shouldn’t have the luxury or agility of fine-grained topic addressing. In that case, you’re missing out on one of the great things about EDA, and you should really go watch a video on this topic.
Figure 3: A sneak peek of how topic routing with wildcards works
An event mesh, as you should know by now, isn’t a single event broker, nor a cluster of brokers acting as one, but a network of several nodes or pairs or clusters of event brokers working together much like how IP routers work to form a TCP/IP network.
In this alternative architecture, there’s no “subscribers” moving events from the connectivity layer to the processing layer. That’s because these PubSub+ brokers are linked to form an event mesh. A different application of event mesh is distributing events simply by leveraging the topic routing. This is really useful for multi-sites or even hybrid-cloud topology, instead of relying on cumbersome and expensive cross-site replications.
Internet of Things (IoT) is closely related to the use of MQTT protocol, but MQTT is being used for more than just IoT, just like Qiscus’ and many other mobile applications or even real-time dashboards.
Event mesh has a few more tricks up its sleeves for IoT solutions. One is the tiering or layering of brokers to form a tree topology that can support larger connection counts and greater geographic distributions. Another is bidirectional communications, i.e. not just streaming events from devices to back end servers, but also sending out events to devices or mobile apps as part of command and control interactions.
Our CTO Shawn McAllister produced a fun demo of those capabilities, and hope you’ll watch what I playfully like to call “The Architects’ Guide to Honking the Horn”.
While I didn’t do a performance test for this architecture, there is a public report on Solace PubSub+ performance numbers for several possible combinations of deployment types, payload sizes, delivery mode, and client connections.
To recap, check out this summary table for the features and requirements discussed in this post.
What I want to bring up with this alternative architecture is simplicity. Yes, simplicity. Fewer moving parts is better. Leave the event distribution and delivery to the infrastructure and provide support for the many open standard protocols and APIs to avoid rewrites.
I hope this architecture gives you some ideas for your own architecture and set of challenges. Get in touch for an ideas exchange or just quick chats in my LinkedIn or post your questions in the Solace Developer Community. And lastly, head out to Solace.Dev if you are a developer and want to get hands-on with event mesh!
The post A Simpler Architecture for Handling High Connection Counts and Throughput appeared first on Solace.