DEV Community: Gaurav Bansal

Messaging Queues in Distributed Systems: Design, Challenges, and Innovations

Gaurav Bansal — Wed, 11 Jun 2025 21:31:00 +0000

Introduction

In modern distributed architectures, services often need to communicate asynchronously to ensure decoupling, scalability, and fault tolerance. A distributed messaging queue system enables reliable communication between producers and consumers without requiring direct synchronous interaction. This document explores key components, architecture, design considerations, and future trends of distributed messaging queues.

System Overview

A messaging queue system[1][3] enables asynchronous communication between a producer, which sends messages, and a consumer, which processes them. Instead of directly calling the consumer, the producer sends messages to a queue, ensuring reliability and scalability. This system supports two primary message delivery models:

Queue Model: A message is delivered to only one consumer.
Topic Model: A message is delivered to multiple subscribers. The system must be designed to handle high availability, fault tolerance, scalability, and performance optimizations while ensuring data durability and integrity.

System Architecture

The system architecture of a distributed messaging queue must ensure reliability, scalability, and fault tolerance. To achieve this, several key components work together seamlessly. A Load Balancer (LB) distributes incoming requests across multiple frontend nodes, ensuring even distribution and redundancy. The Frontend Web Service handles request validation, authentication, caching, and rate limiting before passing requests to the backend. The Queue Metadata Manager stores queue-related information, such as queue names, creation timestamps, and ownership details. Meanwhile, the Backend Message Store is responsible for persisting messages reliably and enabling efficient retrieval. To ensure data durability, a Replication Mechanism distributes messages across multiple storage nodes. Additionally, Failure Detection and Recovery continuously monitor system health and automatically recover from failures, maintaining high availability. To support large-scale distributed messaging, the system comprises several key components:

Load Balancer (LB): Distributes incoming requests across multiple frontend nodes to ensure even load distribution and redundancy.
Frontend Web Service: Handles request validation, authentication, caching, and rate limiting before forwarding requests to the backend.
Metadata Manager: Stores queue-related information, including queue names, creation timestamps, and ownership details.
Backend Message Store: Persists messages reliably and ensures efficient retrieval.
Replication Mechanism: Ensures message durability by replicating data across multiple storage nodes.
Failure Detection and Recovery: Monitors system health and automatically recovers from failures to maintain high availability.

Approaches to Message Delivery

Approach 1: Synchronous Communication (Direct API Calls)

In synchronous communication, the producer directly calls the consumer and waits for a response. This approach is simple and offers low latency since there is no intermediary component. However, it has significant downsides, including difficulties in handling failures, the risk of overwhelming the consumer, and poor scalability. If the consumer service becomes slow or unavailable, the producer may experience delays or failures.

Approach 2: Asynchronous Communication (Message Queue)

With an asynchronous message queue, the producer sends messages to a queue, and a consumer retrieves them at a later time. This approach improves fault tolerance, scalability, and load handling, making it suitable for high-traffic environments. However, it introduces additional complexity, as it requires infrastructure to manage the queue, ensure message persistence, and handle delayed processing.

Approach 3: Hybrid Approach

A hybrid approach combines synchronous and asynchronous messaging, allowing producers to send messages directly when real-time responses are needed while leveraging a queue for background tasks. This approach balances performance, reliability, and cost-effectiveness. However, it requires careful coordination to determine when to use each method and ensure seamless integration.

Detailed Design

The Load Balancer is responsible for routing client requests to available frontend nodes, ensuring even distribution of traffic. It supports failover by using primary and secondary nodes to maintain high availability. Additionally, VIP partitioning is used to scale the system by mapping multiple DNS records to different load balancers.
The Frontend Web Service plays a critical role in handling client requests. It ensures request validation, checking that necessary parameters like queue names and message size limits are met. The service also manages authentication and authorization, verifying the sender’s identity and permissions. Caching is implemented to store frequently accessed metadata, improving response times. To prevent overload, the service employs rate limiting, which restricts excessive requests. Additionally, request deduplication is used to maintain exact-once message processing, ensuring reliability.
The Queue Metadata Manager is designed to store queue configurations and maintain a consistent mapping of queues to backend nodes. It optimizes performance through caching strategies and ensures efficient lookups with consistent hashing, reducing data retrieval overhead.
The Backend Message Store serves as a durable, fault-tolerant storage system for messages. It supports data replication to maintain high availability and is optimized for high-throughput writes, ensuring efficient message retrieval. This component guarantees that messages are stored reliably and can be processed with minimal latency.

Handling Failures and Scalability

Future Trends and Innovations

The evolution of distributed messaging queues continues to be driven by advancements in cloud computing, AI-driven optimizations, and edge technologies. Below are some key trends that will shape the future of messaging systems:

Serverless Messaging Queues: Cloud providers offer serverless messaging solutions like AWS SQS[1] and Google Pub/Sub[2], reducing operational overhead while improving scalability and reliability. These managed solutions eliminate the need for infrastructure management, allowing developers to focus on application logic.
AI and Predictive Analytics: Machine learning models are being increasingly used to analyze messaging patterns, predict peak loads, and optimize message routing dynamically. AI-driven insights[5] can help scale resources efficiently, reducing latency and improving system performance.
Multi-Cloud and Hybrid Deployments: Organizations are increasingly adopting multi-cloud strategies, where messaging queues span multiple cloud providers. This approach enhances redundancy, prevents vendor lock-in, and optimizes cost efficiency[4].

Conclusion

A distributed messaging queue system is a fundamental building block of modern scalable architectures. By implementing a well-structured queueing mechanism, organizations can ensure fault tolerance, scalability, durability, and performance. Future advancements in AI-driven traffic management, serverless architecture, and event-driven messaging will further enhance the capabilities of distributed message queue systems.

References

AWS Blogs. (2023). AWS Simple Queue Service (SQS) Documentation https://docs.aws.amazon.com/sqs/
Google Cloud. (2023). Pub/Sub Overview https://cloud.google.com/pubsub 3.** Microsoft Azure.** (2023). Service Bus Messaging https://learn.microsoft.com/en-us/azure/service-bus-messaging/
IBM Topics. (2023). Multi-Cloud Strategies for Enterprise https://www.ibm.com/think/topics/multicloud
Medium (2025). AI-Driven Infrastructure Scaling https://medium.com/doctor-ai/scaling-ai-infrastructure-lessons-from-building-high-performance-systems-65e9c062bac9

About me

I am a Senior Staff Software Engineer at Uber with over a decade of experience in scalable, high-performance distributed systems. I have worked on cloud-native architectures, database optimization, and building large-scale distributed systems
Connect with me at Linkedin

Monitoring What Matters: Practical Alerting for Scalable Systems

Gaurav Bansal — Fri, 04 Apr 2025 20:57:03 +0000

Introduction

In modern distributed systems, performance isn't just about speed—it's about balancing latency, availability, and resource efficiency at scale. Effective alerting is essential for maintaining this balance. Without it, teams may miss real failures, overreact to false positives, or remain blind to slow degradation. This guide lays out a practical approach to designing alerts that matter—so you can catch what’s broken, ignore what’s not, and scale confidently. [1][2]

Alerts depend on the type of service being monitored. Customer-facing services need different alerts than backend systems or batch jobs. But no matter the service, you should at least set alerts for these key areas.

Availability

Availability just means that the system is ready to do its job. For services that take requests, it means the service is running and can handle any incoming requests. For backend processes, it means the system is either working right now or ready to start when there are tasks to handle.

Latency

Latency is how long it takes for your system to do something. For a service, it’s the time it takes from when a request comes in to when the response is sent out. This is the total time for the request to travel to the service and the response to go back. However, we can only measure latency from the service’s side, so things like network delays or other issues outside the service aren’t counted.

For backend processes, latency is how long it takes to finish a task—like processing a message from a queue, completing a step in a workflow, or finishing a batch job.

Compute metrics

Compute metrics track things like CPU usage, memory usage, disk space, etc. These alerts help make sure individual servers aren’t causing problems that might get hidden in overall system metrics.

Call volume

They help protect your service from hitting its limits and let you know if you need to scale up. These alerts should track the total number of requests your service can handle at its current size, including limits on your servers and any dependencies.

Deciding Severity

Severity should reflect how critical the system is and how urgent the issue is to fix. If you create a high severity alert, it's a good idea to also create a low severity alert at a more sensitive level. This acts as an early alert to catch issues before they escalate.

For some services, only low severity alerts may work well in case of very low customer impact.

Setting thresholds

Threshold setting depends on the SLA of the services. If the service impacts a large customer base then the thresholds can be sensitive.

Availability

Latency

P50 (50th percentile) is the median latency, meaning half of the requests are processed faster, and half take longer.

P90 (90th percentile) is the latency where 90% of the requests are faster, and 10% take longer.

P99 (99th percentile) is the latency where 99% of the requests are faster, and only 1% take longer.

P99.9 (99.9th percentile) is the latency where 99.9% of the requests are faster, and only 0.1% take longer.

Threshold values should be set based on your SLA and the past performance of your application. The highest threshold should match your SLA or your client’s expectations. If your service is causing latency, it’s important that your system alerts you before your clients notice.

To set the maximum threshold, check with your clients to understand their SLAs.

For the minimum threshold, review your service’s performance over at least the last 45 days. Consider any recent or upcoming changes, like increased traffic, a new API, new dependencies.

Compute metrics

Compute metrics should be set on CPU usage e.g. alert when CPU usage is higher than 80%. Similarly for memory usage, disk space, etc

Call volume

The call volume threshold should be set just below the point where the service starts to break or fail.

Notification channels

In order to alert the on-call to notify for the alert to take immediate actions, there can be various notification channels integrated with the alert framework.

Conclusion

Good alerts help teams catch problems early and fix them before they impact users. It’s important to set alerts for key areas like availability, latency, compute usage, and call volume. Each service is different, so alerts should match the type of service and how critical it is.

Setting the right severity and thresholds helps reduce noise and focus attention on real issues. Using the right notification channels makes sure the right people get alerted in time. With clear alerts and smart settings, teams can keep systems healthy and reliable.

References

Datadoghq Blogs. (2015). Monitoring 101: Alerting on what matters
https://www.datadoghq.com/blog/monitoring-101-alerting
Microsoft Blogs. (2014). Recommendations for designing a reliable monitoring and alerting strategy https://learn.microsoft.com/en-us/azure/well-architected/reliability/monitoring-alerting-strategy
Medium. (2024). Managing Critical Alerts through PagerDuty’s Event Rules https://medium.com/@davidcesc/managing-critical-alerts-through-pagerdutys-event-rules-2c7014eded3d

About me

Designing a Scalable and Real-Time Messaging System

Gaurav Bansal — Sun, 02 Mar 2025 20:25:34 +0000

Introduction

In this article, we will explore building a highly scale distributed[1] messaging system like whatsapp.

Requirements

Functional Requirements

1:1 chat
Send Text/Images
Last seen
Notify user about the new messages while he was offline, when he comes back online
Read receipt (single double and blue tick)

Non-functional requirements

Low latency - people should receive msg immediately
High availability - should not go down
No lag - real time system

API

POST /api/v1/chat/{conversationId} Body - {Message text}
GET /api/v1/chat/{conversationId} Returns {List}
GET /api/v1/chat/status/{userId}
GET /api/v1/lastseen/{userId}

Let’s look at the overall architecture of the whole system. First we will discuss the chatting solution and then we’ll discuss other pieces that surround it.

High Level Design

Chat server, clients and db.

Deep Dive

System needs to receive incoming message, deliver outgoing messages.
Store and retrieve message from db
Store user’s last seen record

Message Delivery Models

Pull Model: In this approach, clients periodically check the server for new messages. The server stores undelivered messages and provides them when the recipient requests updates. To minimize latency, clients must poll frequently, often receiving empty responses when no messages are pending. This method can be inefficient as it consumes unnecessary resources.
Push Model: Active users maintain an open connection with the server, allowing instant message delivery as soon as they arrive. This eliminates the need for tracking pending messages and ensures low-latency communication. WebSockets[3] are commonly used to implement this model.
[2]

WebSocket Handling

A WebSocket handler (WSH) on the backend maintains open connections with all active users who have an internet connection. These connections enable real-time message transmission across various platforms, including mobile apps, web browsers, and smartwatches.

Websocket connections are bi-directional, any party (client/server) can send messages to the other one.

WebSocket and Message Management

WebSocket Manager (WSM): WSM tracks which devices are connected to which users. It operates on a database, storing connection details between users and WebSockets. If a connection drops, the user reconnects to a different WebSocket server, and WSM updates this information in the database.
Message Service (MS): This component stores all system messages and retrieves unread messages for users. It also runs on a database to ensure reliability.
Facebook: Stores all messages permanently in its database.
WhatsApp: Stores messages temporarily—once a message is delivered and acknowledged, it is deleted from the system.

Use cases

Assumptions:
User U1 is connected to WSH1 and wants to send message M1 to user U2.
WSM returns WSH2 which is connected to U2

Case 1: Both U1 and U2 are online and sending message to each other

There are multiple calls going to WSM -> We can keep a cache in front of WSM for optimization which will contain all users (online/offline)
2.1 and 2.2 will happen in parallel.
U2 is using the app and reads the delivered message, so U2 sends “Received and Read” status.

Case 2 - U1 sends a msg and U2 is offline

If U2 is offline, the message is saved in db via MS and WSH1 sends Sent status to U1.

Case 3 - U2 comes online

U2 requests for all messages which are not received or not read.

Case 4 - U1 is offline and sends a msg
Messages will be stored locally on the phone db. Whenever device comes online, it will push the message from the db to websocket handler

Send File

U1 -> U2
In the approach below the image will be sent instead of text so as a result for each connection more network bandwidth would be required as it sends image over the wire

Optimized

WSH1 will get the URL from the Image server and give it to U1. U1 directly uploads the image on the given URL and sends a message to WSH1. Then the URL is sent to U2 as a text message. Once U2 receives the URL it will also directly download the image from the image server.
Here the device can compress the image before uploading in the image server.

DB Schema

WSM DB
Schema -> UserId, WSH id, timestamp (last seen)
Queries -> GetWSH(userId)

Messaging Service DB
Schema -> conversationId, userTo, userFrom, timestamp, status, fileUrl, type (type of file image, video, text)
Partition key -> conversationId, sortKey -> timestamp_uuid
Queries
getMessageGreaterThanTimestamp(conversationId, timestamp, maxCount) -> will paginate results if the result is greater than maxCount
getMessageInfo(conversationId, timestamp)
Puts
putMessage(conversationId, userFrom, userTo, timestamp...)

Conversation DB
Schema -> UserId1,UserId2,ConversationId; PK or ParKey- User1_User2
Queries
getConvers(U1, U2);
getConversation(U1) - Secondary Index - on both U1 and U2
We can use No SQL databases[4] like AWS DynamoDB

Conclusion

A messaging system should be fast, reliable, and scalable. The push model with WebSockets enables real-time communication, while efficient WebSocket management ensures smooth interactions. Whether storing messages permanently or temporarily, the goal remains the same—delivering messages instantly while keeping the system efficient.

References

Splunk. 2024. Distributed System https://www.splunk.com/en_us/blog/learn/distributed-systems.html
Medium. 2018. Push Vs Pull model https://medium.com/@_JeffPoole/thoughts-on-push-vs-pull-architectures-666f1eab20c2
GeeksForGeeks. 2024. What is web socket. https://www.geeksforgeeks.org/what-is-web-socket-and-how-it-is-different-from-the-http/
Mongodb Blog. 2023. No SQL Data base https://www.mongodb.com/resources/basics/databases/nosql-explained

About me

I am a Sr Staff Software Engineer with over a decade of experience in scalable, high-performance distributed systems. I have worked on cloud-native architectures, database optimization, and large-scale distributed systems
Connect with me at Linkedin