Queues are necessary for distributed web architectures. They provide the glue for inter-process communication while letting each service operate independently.
SaaS applications generally have different tenants or accounts that contain data that a collection of users can see and share. Ideally, each tenant is supposed to be isolated from other tenants. In reality, there are different levels of isolation that have their own tradeoffs. In this article, I will specifically cover logical isolation, not physical isolation, and focus on how to achieve that with Redis queues.
So why should we care about isolation in the context of SaaS applications? Most simply put: it's what the user expects. Users expect small operations to finish quickly regardless of what's going on in the system. We don't want large tenants hoarding system resources because small tenants may be affected by having to wait longer for their data to get processed. This degradation of user experience is especially amplified whenever there is some slowness or throttling that's happening downstream of the application, like the database.
So how do we support isolation with message queues? The simplest solution is have each tenant have their own queue. As messages are read from each queue, we want to balance the reading across all tenants equally. Another way to put this is the messages are read in a round-robin fashion. There is some basic controller logic that needs to keep track of the previous of order of tenants that were serviced.
Because queues in Redis are just lists, it's very cheap and easy to create queues in Redis. There are some important assurances you don't get with simple list-based queues, however in many cases, simple error detection and retry logic can overcome most concerns. Note: if at-least-once guarantees are required, then Redis streams can be a better fit. If you don't need those guarantees, list-based queues are still very effective and reliable.
Other advantages of using Redis queues are:
- No real restrictions on message size. Most queue technologies have restrictions on message length being a few MBs or less. Yet Redis message payloads can contain rich data, rather then simply containing an id or reference to a document stored in something like S3. This also reduces i/o calls to another data store.
- Redis queues can be created dynamically and in large numbers - prerequisites for the design below.
- Redis is probably already in your stack. It's widely used and there's probably already familiarity with Redis within your org.
- Redis queues are fast. According to this, Redis is capable of millions of messages per second. I've never needed that kind of throughput, since any task that performs external network calls becomes the dominant factor in overall latency. Performance can also be tuned by batching multiple messages into one larger message.
The diagram above shows the controller logic and how the data flows through the queue. The producer pushes a message to one of the data queues that corresponds to the requested partitionId. In the diagram above, there are three partitions: A, B and C - which are just three unique string identifiers. These could also represent three different tenant ids. The producer evaluates if the provided partition is located in the Active Partition Set (APS). If not, the producer will add the provided partitionId to the APS and also write it to the Round Robin Queue (RRQ). All of the producer logic is executed as a Lua script in Redis. By doing so, we can guarantee atomic behavior while evaluating or updating multiple keys.
The consumer has its own logic. It will first read from the RRQ to determine which partition to read from. The consumer then reads the message from the respective data queue and inserts the partitionId to the back of the RRQ. Conversely, if there's no data in the data queue, the consumer will remove the partitionId from the APS and NOT insert the partition id into the RRQ. Essentially, the consumer keeps re-adding the partitionId to the RRQ until the nothing is read from the respective data queue. That partition becomes inactive and can only become active when the producer re-adds the partitionId to the RRQ.
This queue design could also support isolation from user to user. The only major concern with that is there's a theoretical limit on how fast the producer logic can perform the updates. The keys involved with the controller logic cannot be scaled horizontally either because they must run on the same node. The overall queue latency is also affected by the sizes of the messages. Still, to reach these limits, kudos to you, but I doubt most people will reach them. I'm estimating the controller logic takes roughly 1ms to execute. Ignoring network latency for now, that means the throughput would be 1000 messages per second. If you really need to overcome that threshold, you can also create more logical queues and run them in parallel, maybe even in a different availability zone, region or datacenter. In addition, batching data in each message can also greatly improve throughput.
Top comments (0)