Deepak Singh Solanki

Posted on Apr 12 • Originally published at deepakinsights.Medium on Mar 14

Message Queue in System Design

#systemdesignconcepts #messagequeue #systemdesignintervie

The Day My Server Gave Up in 60 Seconds

In April 2018, it was our project launch day. I was working on a project for the last 6 months, spending many restless nights to shape it into something I was truly proud of.

We launched at 9:30 AM, and by 9:31 AM, the server had crashed.

Everyone was shocked. No one had a clue what had just happened. I still remember the silence in the room. It was a horrible, heavy silence when something went wrong in front of everyone. Later, we came to know that thousands of OTP requests were hitting our system at the same time. And we were handling all of them one by one. Synchronously. Imagine a single ticket counter and an entire stadium full of people.

That one minute, it changed everything for me.

And the solution? A queue. Same thing you stand in while buying a movie ticket. The same one you curse while waiting for your movie ticket. But inside a system, it decide difference between a crash and calm.

That day in 2018, I wished I had known this earlier. Now you will.

Why Servers Break Under Pressure

Think about WhatsApp for a second.

Every second, millions of people are sending messages to each other. It may be a birthday wishes, memes, voice notes, and office updates. Every single message is a request to handle for WhatsApp’s servers.

Now imagine you are the server.

One request comes in, it’s easy to handle. Ten requests, it’s still fine. A thousand requests make it hard. A million requests in the same second make you collapse.

WhatsApp isn’t the only the one who are facing this problem, all popular apps face it, and you may face the same in the future. The day your app goes from 100 users to 100,000 users, an influencer shares your product, or the day your sale goes live. Suddenly, your server is not handling requests; it is drowning in them.

Do you know the worst part? Most servers are built to handle requests simultaneously i.e. one by one. It finishes the first request, then moves to the next. It works fine until it gets high traffic. Once the moment traffic spikes, everything slows down, requests pile up, and eventually, the server gives up.

I know this feeling as I lived it in April 2018.

Now think about the real question, how do apps like WhatsApp, Instagram, and BookMyShow handle this without breaking? How does their system stay calm and running even millions of people are using it continuously?

That’s exactly what we are going to discuss hereafter.

So What Exactly Saved My System?

We all remember those days when our parents or grandparents used to write letters. Postman collects these letters and everytime deliver the letter to destination. We always know that the letter always reached, without tracking, and without any confirmation.

A message queue works the same way.

In a system, one part of your application writes a message and drops it into the queue, just like giving a letter to the postman. Another part of your application picks it up from the queue and processes it, just like the person receiving the letter. Neither side needs to talk to the other directly. Neither side needs to wait for the other.

This is what developers mean when they say “decoupled architecture.” Neither side knows the other exists and never waits for the other. If one side is busy or temporarily down, then the message just sits patiently in the queue until someone picks it up.

You can consider a message queue as a waiting room for data. Requests are taken one by one. No chaos. No crashes. No one stepping on each other’s toes.

Remember about the OTP crash in 2018? A message queue could have avoid it completely.

Meet the Five Players Behind Every Message

Before we deep dive more, let me first introduce you to the five key players that make this work. We will stick with our postman story, it makes everything click faster.

1. Producer/Publisher: The Letter Writer

We never worried about how our letter would reach. We wrote it, and handed it over to the postman. A producer works exectly the same, it creates a message and drops it into the queue. Producer completed his job. Every message contains the actual data it want to send (payload) and some background details like when it was created and how urgent it is (metadata).

2. Queue: The Post Office

The postman does not deliver letters immediately. It takes time. You can consider it as doctor’s clinic, where a receptionist that takes your information and ask you to wait until someone is free to help. Queue never do any processing, never read the message, it just keep the message safely stored.

3. Consumer/Subscriber: The Letter Receiver

Think about receiving a new iPhone parcel. You walk to door to receive the parcel, open it and start clicking picture. The consumer works the same. It connect to the queue to consume a message, then it puts message to work, processing it, storing it, or triggering whatever needs to happen next. The things finally happens here. Some systems have one consumer reading the queue. Others have ten. Depends on how much traffic your system handles.

4. Broker/Queue Manager: The Postman

Without the postman, letters pile up, and never deliver. The broker is responsible for receiving message from the producer and drops it in the right queue. It make sure that the correct consumer picks it up. If a message lost? He retries. A wrong destination? He reroutes. He manage everything, no one can bypass it.

5. Message: The Letter Itself

Every letter has two things. What’s written inside, that is your actual data, we call it payload. And the envelope, the address, date, stamp, all the details written outside. We call that metadata. The metadata tells the system about its origin, destination & how urgent it is.

The Complete Journey of a Single Message

Now, let’s understand about these players work together. We will follow one message start to finish, the same way a letter travel from origin to destination.

Step 1: Message Creation

Your wote the letter and address, it now ready for post. Similarly, the producer creates a message with the actual data (payload) and some extra details like timestamp and priority (metadata) and the message is ready to travel.

Step 2: Message Enqueue

You hands it to the postman. The producer send the message to the queue and move forward. He did not worried about checking and waiting for furture communication. The message wait there until someone picks it up.

Step 3: Message Storage

The postman collect letters and keep it in bag. Depending upon the system requirement, queue stores the message in memory (if speed matters), or on disk (if you can’t risk losing it).

Step 4: Message Dequeue

The postman reaches the destination and knocks on the door. The consumer grabs the message and starts working on it. Some systems process one message at a time. Others throw ten consumers into the queue simultaneously. Both work fine.

Step 5: Acknowledgment

The receiver signs for the delivery. The consumer sends a signal back saying the message was processed successfully. Acknowledgment is most important, never skip it.

Step 6: Message Deletion

Once getting delivery confirmation, the broker removes the message from the queue permanently. Similarly, you stop tracking you parcel once it recieved. If no confirmation comes back, the It keep retries until someone actually finishes the job. The message stays safe and never lost.

These six steps makes the full journey. I keep thinking about that day. If message queue was there in 2018, those OTP requests would have just waited their turn. No pile up. No crash. This time, no silence in the room. Just a big celebration.

Not All Queues Work the Same Way

Different queues solves different problems. Here are the four types you’ll come across most often.

Point-to-Point (P2P) Queue One sender and one receiver that’ the Point-to-Point queue in the system. Even if multiple consumers are listening, still only one gets each message. Once it’s processed and acknowledged, it’s gone. Good for tasks that should run exactly once, like charging a payment or sending an invoice.

2. Publish/Subscribe (Pub/Sub) Queue

One sender. Many receivers. That’s it. Producer drop a message to a topic and forget about it. Every subscriber of that topic get their own copy. The sender has no idea who’s listening and doesn’t need to. When you place a Swiggy order that one event notify your app, the restaurant, and the delivery partner all at once. That’s Pub/Sub.

3. Priority Queue

Some messages are more important than others. A payment failure alert can’t wait behind a promotional email? Priority queue solve this problem. You assign urgency. Critical goes first and the rest waits.

4. Dead Letter Queue (DLQ)

Sometimes a message just refuse to process. Wrong format, failed retries, some unexpected error. These broken messages, if you leave them in main queue, they block everything else. So system moves them to a separate place, we call it Dead Letter Queue.

I have used this more times than I want to admit. It let you investigate what went wrong, fix it, and move on. Main queue never get disturbed.

You Are Already Using This Every Day

Message queues are already part of our daily life. You just didn’t realize it. Any app that survive high traffic, trust me, has a queue somewhere in background doing its job quietly.

1. WhatsApp Messages

Send a message, get one tick, then two. That first tick indicate that your message reached to the queue. The second ensure that it reached your friend. That small delay shows that a message queue is working in background to deliver this mesage.

2. Swiggy / Zomato Orders

Place an order, and within seconds, the restaurant gets a notification, the delivery partner gets assigned, and you get a confirmation. That’s not happening simultaneously. Queue that your order and route it to each party one step at a time.

3. OTP Messages

Sometimes it takes time for the OTP to arrive, because thousands of users requested it at the same time. The system adds every request to a queue and sends them out in order. In 2018, we didn’t have this. Our system tried to process all OTP requests at once, and the server collapsed in 60 seconds.

4. Email Notifications

A company sends promotional email to five million users, but they never hit send for all five million at once. Every email goes into queue first, then process in batches. Steady and Controlled. No spike.

5. BookMyShow Ticket Booking

IPL ticket available on BookMyShow. Million people trying to book ticket at exact same second. Without the queue, server crash in seconds. With queue, requests just line up. Some people still don’t get tickets, but atleast the site don’t crash. That’s the win.

Which Tool Should You Actually Start With?

After the 2018 crash, I start learning about the message queue tools. I have personally worked with RabbitMQ and SQS. Both are solid. But there are other options too depending on your use case.

1. RabbitMQ: Start Here

Simple. Well documented. Easy to run locally. I built my first message queue with RabbitMQ, and honestly it taught me a lot. If you are just starting out, this is where you begin. Works well for email and notification systems. Small to medium projects with flexible routing? RabbitMQ is good fit for that. Not as powerful as Kafka. Not as managed as SQS. But open source, and it adapt to almost anything.

2. Apache Kafka: When Scale Gets Serious

Kafka is different. You throw millions of messages per second at it. It just keep going. Don’t even slow down. Swiggy, Ola, real time analytics systems, all running Kafka somewhere in background. It take time to learn. But once you understand it, you see why big systems trust it.

3. Amazon SQS: For AWS Teams You don’t need to manage any servers and you don’t have setup headaches. AWS will do it for you. If you’re already on AWS, SQS is the best choice. You just need to pay for what you use and keep scaling automatically. Let AWS worry about the infrastructure.

4. IBM MQ: For Systems That Can’t Lose Messages

Banks and large enterprises trust this. Because in financial systems, losing even single message is not an option. Yes it’s expensive. Yes it’s complex. But when money is involved, that tradeoff make complete sense.

5. Apache ActiveMQ (Artemis): Middle Ground

Working on small or medium project with some routing needs? ActiveMQ is a good fit for that. Not as powerful as Kafka. Not as managed as SQS. But open source, and it adapt to almost anything.

6. NATS: For Speed

Lightweight yet extremely fast. If your system needs low latency and doesn’t need to persist messages, NATS is worth a look.

Powerful, But Not Without a Cost

Benefits

Message queues do a lot more than just prevent crashes. Here’s what you actually gain.

Decoupling : Producer and consumer, they never need to know each other exist. You want to update one side? Go ahead. Replace it? Fine. Scale it? No problem. Other side never get affected. I have seen this save so much pain in teams where multiple developers are working on same system. Everyone work independently. No one stepping on each other’s toes.

Scalability : Sudden traffic spike? Add more consumers. They pick up from queue on their own. Load gets shared automatically. You don’t even need to redeploy anything.

Reliability : If consumer crashes mid-processing, then the message stays in the queue. When it recovers, it picks up where it left off. We learned this the hard way in 2018.

Async Processing : The producer drops message and keep moving. Don’t wait for anyone. User get response immediately. Background task finish on its own later. This is why your app feel faster than it actually is.

Tradeoffs

The message queues are not the perfect solution. They come with real costs.

Complexity : A message queue is one more thing to manage. Queue, broker, consumers, all running separately alongside your actual application. For small projects, honestly ask yourself, is this complexity really worth it? Sometimes simple is better.

Debugging is Hard : In synchronous systems debugging are easy, so when something breaks, its easy to trace. But async systems have a different story. Its takes a time to find exact problem, when something goes wrong inside a queue. I have spent some very long nights because of this. Trust me on this one.

Message Ordering : Message ordering may be change in queue, because most queues don’t guarantee messages arrive in the order they were sent. If your system depends on sequence, you need to plan for this upfront.

Extra Cost : More infrastructure means more money. Managed services like SQS or Kafka on Confluent aren’t free. Factor this in early.

Message Duplication : If an acknowledgment gets lost, then the same message can arrive more than once. Your system needs to handle this gracefully, or you’ll process the same request twice.

What That 60 Second Crash Taught Me

April 2018 was embarrassing. Six months of work, a room full of people watching, and the server went down in sixty seconds.

But I’m glad it happened.

That one failure teach me the actual understanding about how large systems handle scale. It introduced me to message queues and honestly, it made me a much better developer than I would have been otherwise.

If you’re building something that real users will touch, you need to understand this. You don’t have to start with Kafka. Start with RabbitMQ on a small side project. Build, Break and Fix. That experience will teach you more than reading about it ever could.

Including this article. 😄

Next time you see that single tick on WhatsApp or watch your Swiggy order get confirmed in three seconds, Don’t forgot to thank a queue.

DEV Community