Michael Ade-Kunle for Ably

Posted on Sep 7, 2020

Publish-Subscribe: Introduction to Scalable Messaging

#webdev #beginners #pubsub #mqtt

The publish-subscribe (or pub/sub) messaging pattern is a design pattern that provides a framework for exchanging messages that allows for loose coupling and scaling between the sender of messages (publishers) and receivers (subscribers) on topics they subscribe to.

Messages are sent (pushed) from a publisher to subscribers as they become available. The host (publisher) publishes messages (events) to channels (topics). Subscribers can sign up for the topics they are interested in.

This is different from the standard request/response (pull) models in which publishers check if new data has become available. This makes the pub/sub method the most suitable framework for streaming data in real-time.

It also means that dynamic networks can be built at internet scale. However, building a messaging infrastructure at such a scale can be problematic.

This introduction to the pub/sub messaging pattern describes what it is, and why developers use it, and discusses the difficulties that must be overcome when building a messaging system at scale.

The Ably realtime platform uses the publish-subscribe pattern at internet scale for delivering messages in real-time.

What Is Pub/Sub? Loose Coupling and Scaling

In the pub/sub messaging pattern, publishers do not send messages directly to all subscribers; instead, messages are sent via brokers. Publishers do not know who the subscribers are or to which (if any) topics they subscribe. This means publisher and subscriber operations can operate independently of each other. This is known as loose coupling and removes service dependencies that would otherwise be there in traditional messaging patterns.

Pub/sub is different from the standard request/response models in which publishers (pull) to check if new data is available. This makes the pub/sub method central to effective streaming of data in real-time.

The pub/sub pattern allows extremely dynamic networks to be built at scale without overloading the publishing components or causing unnecessary costs. However, there are difficulties associated with scaling and different ways of getting around these difficulties that need consideration.

Typical uses of the pub/sub pattern include event messaging, instant messaging, and data streaming (such as live-streaming sporting events). Pub/sub is also used for workload balancing and with asynchronous workflows.

Communication infrastructure for a pub/sub system (Diagram adapted from msn).

A Background to Messaging Systems and Pub/Sub

A simple information system can follow a simple pattern: input–processing–output. At a reasonable scale, the system will need multiple input and output modules for handling concurrent requests. A problem then arises of routing messages from input modules to their respective output modules.

To solve this routing problem, the input and output modules need an addressing mechanism. It is the processing module’s job to route them to the correct recipient based on an address.

At internet scale, the publish-subscribe pattern can handle tens of thousands of concurrent connections.

At internet scale, the system will handle thousands or even tens of thousands of concurrent connections. It needs to also be capable of handling high volume and global geographical spread of users.

At such a massive scale, the system needs to solve the following problems:

Because of the high volume and geographical spread, the load needs to be distributed between multiple processing modules.
Predefined addressing between the modules becomes a huge overhead.

In short, the problems come down to minimizing the shared knowledge of addresses. Pub/sub solves the problems by using a data pipe through which modules can post and retrieve their messages.

The modules do not need to maintain shared knowledge of the whereabouts of other modules. The input modules only accept user input, processing modules only process the data, and the output modules only display the output.

In pub/sub, there is one channel for posting messages and one for retrieving. It happens in steps like this:

The input module will gather the user input and post the message in the preprocessing channel.
The processing module will pick the messages from this channel, process it and post it to the post-processing channel.
Lastly, the output module will collect the message from the post-processing channel and display it on the users’ screens.

The same pattern works at any scale.

In pub/sub messaging pre- and post-processing of the messages is used to address routing problems at internet scale.

Why Developers Use Pub/Sub

A logistics company, in theory, would typically have a mix of customer data and generic data and a highly variable customer load. The data channels between the customers, the drivers, and the delivery office may also be unreliable. It is important that subscribers receive all of the messages customers are sending, but it is not necessary to know about the customers or how many there are.

It is also important that the company does not over-provision their service (which would be costly), or over-provision load balancing, which would add extra complexity and be detrimental to the performance of the network.

It is important to remember that the pub/sub pattern is suited to conveying information whose relevance fades fast. (What is the score now? And now?) As information is frequently replaced, there is no pressing need to store it. Usually, it is enough to keep the most recent message, or enough information to recreate a view of fairly recent events.

Developers use pub/sub to take advantage of edge computing and the network backbone:

Edge computing allows you to scale the system at the edge. This is where scaling is easier to implement and also where it is most cost-effective.
Using the network backbone and multiple points of presence means message delivery can be much faster and more reliable.

How Pub/Sub Is Adopted in the Real-World

Event messaging: pub/sub powers many realtime interactions across domains like EdTech, B2B platforms, and delivery logistics. As we shop online more frequently for a wider variety of goods, package delivery has become commonplace. Logistics companies need to use delivery resources more efficiently. To optimize delivery, dispatching systems need up-to-date information on where their drivers are. Pub/sub event messaging helps logistics companies do this.

Dispatchers need to access drivers’ location information on demand, ideally continually. Having this data at the ready allows them to better predict arrival times and improve routing solutions. Dispatching systems also send out information such as cancellations, traffic information, and new package pickups.

As the day goes on, this information becomes more critical. It gets harder to maintain delivery time windows, and schedule adjustments must be made to maximize the number of on-time deliveries.

This is a lot of data, and not all of it is relevant at any given time. To get around this problem, devices need to be able to subscribe to updates that matter to them. With a pattern like pub/sub, all parties only subscribe to whatever is relevant to them:

Driver devices can subscribe to traffic and route information.
Dispatching and ERP systems can subscribe to the completed delivery updates.
Tracking and dispatching systems can get live position updates when they need them.

These systems enable customers to track deliveries in real-time. For example, reschedule any package in transit, and to alert drivers that there are pickups to be made en route, to allow for more effective routing, which reduces fuel costs and improves efficiency.

Other use-case examples include:

Instant messaging: Service that provides near-instantaneous interaction, for example, a notification that the person you’re conversing with is typing.
Data streaming: Applications can provide data instantly to clients for processing, saving or live preview. For example, providing the latest match scores in a tennis tournament and making sure they are available to a new website visitor the moment the page loads. See the Tennis Australia case study.
Workload balancing: Knowing capacity and location of parts of a system allows for better utilization of effort. This includes, for example, allowing logistics dispatchers to use partly empty delivery vehicles for pickup and on-demand delivery.
Asynchronous workflows: As an example, think of factory machines and power, water, and other utility sensors updating central control systems live. Improving the efficiency of the supply chain allows for just-in-time manufacturing, and capacity control.

Pub/sub code examples

Here are two examples of pub/sub applications with code snippets.

Faye

Faye is an open source system used by Aha! Roadmap software and Shopify. It is based on pub/sub messaging. The following code sample shows how to start a server, create a client, and send messages:

Ably Realtime Chat App

Here is an example of how you might add pub/sub functionality to a chat app using one of Ably’s Realtime SDKs.

When the app launches, the SDK initializes and subscribes to the topic that represents a public chat room.

Subsequently, when the user wants to send a chat message, the chat app publishes the message on the same topic.

The app unsubscribes from the channel when the user logs out or leaves the chat room.

What to Consider When Pub/Sub Is Deployed and Scaled

It is straightforward to implement a single-channel pub/sub messaging framework. But when you start to scale, the classic problems of distributed systems engineering emerge. When scaling to multiple channels and increasing complexity to any significant degree, the problems increase, and maintaining reliability becomes difficult.

The Problems of Building a Messaging System at Scale

Distributed messaging systems should ideally have the three elements of reliability, speed, and ordering. However, it’s usually the case that you only get to have two of them. To create a system that allows all three, you have to start at the design level with a watertight mathematical model. It is just about impossible to add in the missing third element later.

These are the problems to deal with:

Ordering of messages. As you start distributing messages over a large network, problems arise with reliably reconstructing the order in which the messages are meant to be delivered. To be reliably fast, you have to send messages using multiple routes in parallel, but you also have to be able to re-order and maintain their original sequence.
Queuing and auto-persistence of messages. For fault-tolerant, reliable messaging you must build in auto-persistence — otherwise reconstruction is impossible if a system goes down and the records vanish. If you don’t queue messages you can’t reliably reconstruct an order, or handle fluctuations in bandwidth.
Send exactly once. To send a message once and for it to be received only once at its required destination is a classic problem. If you don’t know who is receiving the message, it has to go everywhere. You have to have logic either in the network to stop it from arriving twice; or in the application to stop it from being processed twice. Otherwise, you might trigger an event twice with unintended consequences. For example, while making an online payment a user is disconnected and quickly reconnects. If exactly-once semantics are not supported, the user can end up getting charged a second time when they reconnect.
Distributed storage. Fault tolerance requires multiple points of redundancy, failover storage, storage in different physical locations, and auto-healing networks. True reliability requires redundant physical hardware along with multiple cloud instances. The trade-off with such redundancy is complexity vs. security and safety.
Load surge and slowdown. Actively scaling a very transient load dynamically, allowing quick scale-up and slower scale-down, to maintain a fair and available network for users.
Rate limitation: Fair workload balancing is complicated. When your system becomes complex you need to consider how to manage customer usage. You have to provision service capacity for different customers fairly, without imposing hard limits.

These are all problems of building a system at scale. Because you don’t necessarily know all the information you might need about your system at any given time, either the framework needs to be clever enough to handle it, or all the applications in your system need to be quite advanced.

Ably balances the above concerns through judicious use of the TCP layer. By generating multiple paths, we gain reliability but without the expense of speed — we can do fast pathing because we control the path we follow. Also, because of the way the network is set up we can maintain ordering, which is often lost in the trade-off with speed of delivery.

This is baked in at the design stage, because the problems that arise when building in a global framework are almost impossible to correct at a later stage.

SaaS or Self-Deploy?

You can either build a pub/sub messaging infrastructure yourself (self-deploy) or adopt a cloud native Software-as-a-Service (SaaS) infrastructure, such as Ably.

Solving the design considerations of building a globally scaling system is far from easy for reasons described in the previous section. Building your own messaging system requires budgeting for more design upfront.

If choosing to self-deploy, there are also considerations such as infrastructure setup, installing, and framework configuration. Doing these yourself gives you oversight of building the features you want in your system, but is also time-consuming and expensive.

The advantages of “as-a-service” pub/sub infrastructure over self-deployment are:

Reduced development time. Pub/Sub isolates application development from the messaging infrastructure.
Managed infrastructure is preconfigured. System tuning, security and design considerations are costly and time-consuming.
Programming options. Managed services support popular programming languages and frameworks. On the other hand, message broker frameworks support only a few languages. Building and maintaining SDKs for your own message broker is a diversion of development effort and time.
Skills. Hiring distributed systems engineers is difficult. If putting together a systems engineering team becomes part of your core infrastructure, you then have to maintain their skill set.
Cost. Most SaaS business models offer controllable levels of expenditure. You pay according to your needs and usage. Although it might seem cheaper to self-deploy, this hides the amount of investment required to build, run, and maintain the software. Your cloud bills are not the only expense.

Publish-Subscribe at Ably

Ably is an enterprise-ready pub/sub messaging platform. We make it easy to efficiently design, quickly ship, and seamlessly scale critical realtime functionality delivered directly to end-users. Everyday we deliver billions of realtime messages to millions of users for thousands of companies.

We power the apps that people, organizations, and enterprises depend on everyday like Lightspeed System’s realtime device management platform for over seven million school-owned devices, Vitac’s live captioning for 100s of millions of multilingual viewers for events like the Olympic Games, and Split’s realtime feature flagging for one trillion feature flags per month.

We’re the only pub/sub platform with a suite of baked-in services to build complete realtime functionality: presence shows a driver’s live GPS location on a home-delivery app, history instantly loads the most recent score when opening a sports app, stream resume automatically handling reconnection when swapping networks, and our integrations extend Ably into third-party clouds and systems like AWS Kinesis and RabbitMQ. With 25+ SDKs we target every major platform across web, mobile, and IoT.

Our platform is mathematically modeled around Four Pillars of Dependability so we’re able to ensure messages don’t get lost while still being delivered at low latency over a secure, reliable, and highly available global edge network.

Developers from startups to industrial giants choose to build on Ably because they simplify engineering, minimize DevOps overhead, and increase development velocity.