DEV Community: Deepak Singh Solanki

Stateful vs Stateless: How a Perfect System Failed

Deepak Singh Solanki — Sat, 11 Apr 2026 04:31:03 +0000

In 2020, COVID had changed everything. Schools and colleges were shutdown and education moved online. Classes, attendance, assignments, exams, fees — everything was online.

Our client had built their ERP years before COVID. It was working fine. They had onboard many schools and institutions. But COVID pushed the usage to a different level. Teachers who were used to teaching in classrooms were now taking online classes. Students who were used to sitting in classrooms were now attending from home. Using the ERP was no longer a choice, it was mandatory.

The system started breaking.

Everyone was getting logged out randomly. Teachers lost work in middle of sessions. Students could not access homework or online classes. Even the staff was struggling with their daily work.

Client’s tech team started working on it. They checked everything. Load balancer config, server health, database. Everything looked correct.

They could not find the problem.

That is when they called us. For expert advice.

I remember looking at their architecture for the first time. Three servers and Nginx load balancer on top. A perfect setup. I also agreed with their team. The design was correct. Nothing was obviously wrong.

Everything was correct and still the system was failing.

That is the day I came to know something I will never forget.

Most developers never think about small configurations, but that can silently break the system completely.

That configuration was about the session and that story is what I want to share with you today.

The Investigation

When we got access to their system, we started from scratch. No assumptions.

We checked with the Nginx load balancer configuration first. Round robin was set up correctly. All three servers were getting equal traffic. We did not find any issue there.

Then we checked the servers. CPU, memory, logs. Everything was healthy. There were no crashes, no errors and no unusual spikes.

Then we checked the database. Queries were fine. No slow queries and no connection issues.

We spent almost a week and were clueless. Load balancer, servers, database, deployment config. Nothing was wrong. The system was built correctly.

Then we moved to the code.

We started checking the authentication and most complained endpoints one by one. Login, dashboard, fee management, homework. Suddenly, something clicked in my mind.

Sessions.

The application was using Laravel session-based authentication. I checked and found that sessions were not stored in the database or Redis. They were stored locally on each server.

That was it. That was the problem.

The Root Cause

Now, Let’s discuss about what was actually happening.

When a user logs in to a Laravel application, the server creates a session. That session stores the user’s identity. Who they are, what they can access. By default, Laravel stores this session in files on the same server.

They setup three servers behind a load balancer.

User visits the website. Request goes to Server A. Server A authenticates the user and creates a session. Session is stored locally on Server A. User is logged in.

Next request from the same user. Load balancer sends it to Server B. Server B has no session for this user. It does not know who this person is. So it throws them out.

That is it. That is why everyone was getting logged out randomly. It was not because of any bug in the code and not because of any misconfiguration in the load balancer. The load balancer was doing exactly what it was supposed to do. Distributing traffic across all three servers.

The problem was that each server was living in its own world. Servers had no shared memory and no shared session. Three servers working as three separate brains.

This is what we call a stateful system.

The Analogy

Let me explain this with something we all have experienced.

Think about a guard standing at school gate. He knows students by face and remembers everyone. When a student arrives, he allows entry without checking id card. No id card required. The guard remembers.

Now that guard retires.

New guard joins next day. He does not know anyone. Rahul comes to school. New guard stops him. Rahul says he is a student here. But new guard has no memory of Rahul. No record. Nothing. So Rahul is not allowed inside even though he is a valid student.

This is exactly what was happening in their system. Server A was the old guard, it remembered the user. But load balancer sent next request to Server B, server B was the new guard. He had no memory and user gets kicked out.

This is a stateful system. The server remembers you. But that memory is private. All other servers will treat you as stranger.

Now think about a different school that have fingerprint machines at every gate. Gate 1, Gate 2, Gate 3. It does not matter which gate you use. Machine scans your fingerprint and verifies it instantly. Machine does not need to remember you. Your fingerprint works the proof.

This is a stateless system. The server does not remember you. But every request have enough information to verify you. No memory needed. Any server can handle your request.

That is the difference.

What is a Stateful System?

A stateful system remembers the client. Every time a request comes in, the server uses the previous information to handle it.

Session based authentication is the most common example of this. When you log in, server creates a session. That session stores who you are. Every next request, server checks that session and responds accordingly. Server is keeping your state.

This works perfectly fine for single server. One server, one memory, no confusion.

But when you add more servers, problem starts. Each server has its own memory. They do not share it. So if Server A created your session and Server B receives your next request, Server B has no idea who you are.

That is the stateful problem in distributed systems.

Some common examples of stateful systems:

Laravel and PHP session based authentication. WebSocket connections to tracks connection state. Online multiplayer games to remembers player position and score.

Stateful systems are not wrong. They are just not built for horizontal scaling.

What is a Stateless System?

A stateless system does not remember anything. It treat every request as new and every request must carry all the information needed to verify and process it.

JWT is the most common example of this. When you log in, server creates a token. That token contains your identity, your role, expiry time. Everything. Server does not store anything. Next request you send, you attach this token. Server reads the token, verifies it and responds. No memory needed.

Any server can handle any request. Server A, Server B, Server C. It does not matter. Every request itself have all the required information.

This is why stateless systems scale so well. You can add 10 servers or 100 servers. Every server can handle every request equally without having any shared memory or shared session sync.

Some common examples of stateless systems:

REST APIs with JWT authentication. AWS Lambda functions. Microservices communicating over HTTP.

Stateless systems are not always better. But when you need to scale horizontally, stateless is the right choice.

The Fix

Once we identified the problem, we explained it to the client team.

They were surprised first. Then they laughed. All this time, three servers, load balancer, proper architecture. And the problem was just session storage. That was it.

We decided to fix it in two steps.

First, we moved the sessions to database. Laravel supports database as session driver out of the box. One configuration change. Now all three servers were reading and writing sessions from the same database. Server A creates session, Server B can read it. Server C can read it. No more random logouts.

This was a quick fix. It worked. But we knew it was not the final solution. Every request was now hitting the database to validate the session. That adds load on database. Not a good idea for scale.

So we planned the second step.

We replaced session based authentication with JWT. Server creates a JWT token on login. Token goes to the client as response and client sends the token with every requests. Server verifies it and responds. No database call needed for validation. No session storage needed. Completely stateless.

But we did not stop here.

The client had multiple applications. Web application, student app, teacher app and many more. Each app had its own login with separate unique sessions and separate auth. We saw an opportunity to fix this properly.

We built a SSO. Single Sign On. One central authentication server to authenticate users on all applications through it. Login once, access everything.

We deployed SSO on a separate server with JWT. Now it was Stateless, Horizontally scalable and ready for any load.

Testing

Once SSO was ready, we did not go live directly.

We tested it with JMeter first and simulated thousands of concurrent users hitting the system.

We test it on Login, dashboard, fee management, homework and other endpoints. Everything worked. Not a single failure. No more random logouts and No session errors. Every request was going to right place.

After this testing, we went to the real users.

We contacted the students, teachers and school staff who had complained earlier and asked them to use the beta system. We collected their feedback one by one and everyone was happy with new system. No one faced any random logouts and system was working fine.

SSO Integration

Once testing was done, we started integrating SSO with all client applications.

Web application was first, then student app, then teacher app and so on. One by one, we replaced their existing auth with our SSO. Every application was now authenticating through one central SSO server. Each system had same JWT token and same stateless flow.

The client had many schools and institutions onboarded. All of them were now using the same SSO. One login for everything.

This also opened new possibilities for the client. Adding a new application in future was easy now. Just integrate with SSO and authentication is done. No need to build auth from scratch every time, just few lines of code make authentication ready.

The system that was failing with 1000 concurrent users was now ready to handle much more. We did not added more servers, just fixed the root cause. Stateless architecture with central auth.

If you want to read about what happened next with this system, I have written about it in my article on the Thundering Herd Problem. That story starts exactly where this one ends and teach you how a Perfect System can fail.

Conclusion

That 2020 incident taught me more than any course or tutorial ever did.

We had a perfectly designed system with three servers and load balancer. Everything correct and still it was failing. It was not because of bad architecture, but because of one small configuration that nobody thought about.

Session storage.

That is the thing about system design. You can design a perfect system, but still can miss one small detail that breaks the system. This is why every decision matters and need a careful consideration. Where you store sessions, how you authenticate users, and how your servers communicate. Every small thing has an impact at scale.

If you are building a system today with multiple servers, ask yourself one question. Is my authentication stateful or stateless? If it is stateful, make sure your sessions are shared. Database, Redis, anything. Just not local server storage.

And if you are planning to scale, move to stateless. JWT, SSO, whatever fits your use case. Your will thank yourself in future self for this decision.

The guard will retire someday. Make sure your system does not depend on his memory.

Continue the Journey with more articles

Still on the platform and enjoyed this ride? Here are more system design trains to catch:

Kafka Explained Like You’re 5: How I finally understood Kafka after years of avoiding it.
The Thundering Herd Problem: What happens when 25,000 students hit my system at the same time.
Message Queue in System Design: How my server crashed in 60 seconds and what I learned from it.
Cache Strategies in Distributed Systems: The day our cache expired and 25,000 students lost their exam.

Looking for more? All articles are available on my Medium profile and many are coming soon. Follow or subscribe to get notified when the next one drops.

Kafka Explained Like You’re 5

Deepak Singh Solanki — Sun, 29 Mar 2026 17:52:54 +0000

Kafka is not a new name. It has been around for years. Every architecture blog, every system design video, every discussion in colleague and developer circle, Kafka comes up. I heard about it many times. But never really stopped to learn it. I never went beyond the surface. Not even once.

I won’t lie. I always thought Kafka was just another message queue. A fancier RabbitMQ. Something I will learn “someday.” That someday never came.

Then I joined a System Design Cohort led by Hitesh Choudhary and Piyush Garg on ChaiCode. One evening, Piyush sir started explaining Kafka.

I was listening. But my mind was doing something else.

Steve Jobs once said: “You can’t connect the dots looking forward; you can only connect them looking backward.”

That evening, my mind was connecting dots backward to the Indian railway station.

We all have been to railway station at least once in life and if you have, you know that feeling. It is not calm. It is not organised. Somehow it just work. Nobody is managing you personally. No one is telling each passenger where to go. Still, people loves to travel in trains. They reach where they have to go. The whole madness somehow just keep running every single day.

I was surprised that Kafka was solving the similar problem in software world.

Piyush sir was explaining Kafka in class and I was seeing every single thing play out at that station in my mind. For the first time, Kafka start making sense to me.

Why Kafka Exists: The Problem First

Before I tell you what Kafka is, let me tell you why it even exist.

Everyone are using IRCTC for train ticket booking. You definitely had an experience of tatkal booking chaos. Everyone wants to book ticket at same time. You finally get that booking confirmation. But did you ever wonder what all happen in that one second? Money got deducted from your account. Seats got booked for you. A message land on your phone. Your ticket was ready somewhere in their system. PNR got updated. All of this happen in few seconds. Not for you alone. For lakhs of people at same time.

Now what if any one of these steps fail or slow down or miss the request? This kind of simultaneous load is something I wrote about earlier — it is called the Thundering Herd Problem.

That is the exact problem Kafka was built to solve.

This is the problem. Modern applications are not simple anymore. They are not one service talking to one other service. They are dozens of services, each generating events, each needing to talk to each other, all at once.

And here is where most people make a mistake. They think Kafka is just a message queue. I thought the same.

It is not.

A message queue is like a postman. One sender, one receiver, message delivered and gone. Simple.

Kafka is a message broker. Not a queue. It sit between all your services like a central hub. Many things pushing data into it at same time. Many others pulling from it. Kafka store every message. Same message, multiple services can read it. Nothing get lost in between.

That difference, trust me, it matter a lot.

Kafka is the Railway Station

Close your eyes for a second.

You are standing at a busy railway station. Not a small one. A big one. Hundreds of passengers arriving every minute. Some coming from Jaipur. Some going to Mumbai. Some waiting for a local train. Everyone with a different destination.

We all have stood at a railway station at some point. That chaos is not new to us. You are looking for your platform, someone is running with luggage, someone is asking for directions, tea vendor is shouting from corner. Nobody is in charge of you personally. Still you reach. Still everyone reach.

Now think about this. That station never stop working. Too many passengers? It handle them. Someone miss their train? Another one is coming. No matter how crowded it get, station just keep doing its job.

That is exactly what Kafka does for your application.

Kafka is that station. Every message in your system is like that passenger who need to reach somewhere. Station manage thousands of people without losing single one. Kafka do exactly that for your software.

Everything connect to Kafka. Every message pass through it and from there, each message reach exactly where it need to go.

Producer: Who Sends the Message

We all have reached a railway station at some point and we all had someone who drop us there. Could be father, friend, or just an auto driver. Once you are inside the station, their job is done. They don’t come with you. They don’t worry about where you will go next.

In Kafka, that person is called a Producer. Any service that generate and send message to Kafka is a Producer. Your payment service send a payment event. Your app send a booking request. It does not matter who send it. Once message reach Kafka, producer job is finish.

Producer don’t care what happens next. Who reads the message. When they read it. That is not producer’s problem. Producer completed his job the moment message reach Kafka.

Topics: The Platforms

You are inside the station now. You are that message, remember? First thing you look for is your platform. Not randomly. You check your ticket. Platform 4. That is where your train is. Every passenger go to their specific platform. Nobody mix up.

In Kafka, these platforms are called Topics. Every message go to a specific topic. Order messages go to orders topic. Payment messages go to payments topic. Nobody mix up there either.

A Topic is a category. Think of it as a named channel. Order related messages always find their own place. Payment messages find theirs. Notifications find theirs. Nobody land in wrong place.

Think of it this way: producer does not just throw message into Kafka randomly. It send message to a specific topic. Just like passenger go to specific platform.

Order related messages will always find the “orders” platform. Payment messages will always find “payments”. No mixing. No confusion.

Partitions: The Bogies

Now here is where it get really interesting.

You found your platform. Train is right there. But when you step closer, you notice something. Train is not just one big space. It is divided into bogies. S1, S2, S3, A1, B2. Each bogie carries different passengers. Together, they all go to same destination.

In Kafka, these bogies are called Partitions.

One Topic can have multiple Partitions. They belongs to the same category, but work is split into smaller lanes. This is where Kafka actually get its real power from.

But here is the important thing Piyush sir explained that day. This is what change everything for me.

Why we need these partitions? Why not just one big bogie?

We all know Indian train classes. General, Sleeper, Third AC, Second AC and First AC. Each class has its own coach. General passengers never sit in First AC coach. First AC passengers never end up in Sleeper. Everyone belong to their own coach. That is just how it work.

Kafka follow same rule. Related messages always go to same partition. Same order ID will always goto same partition. The sequence stay intact. Every time.

Because there are multiple partitions, multiple consumers can work in parallel. More partitions, more speed. That is how Kafka scale without breaking the order.

Consumers: Who Receives the Message

Every passenger on that train is going somewhere. Someone is waiting at the destination. Could be a family member standing at exit gate. Could be a friend who came to pick up. Could be a colleague. Their job is simple. Receive and take action.

In Kafka, that person is called a Consumer.

A Consumer is any service that read messages from Kafka. Your notification service consume the order event and send push notification. Your analytics service consume the same event and update dashboard. Your inventory service consume it and reduce stock count.

Consumer sit and wait. When message arrive in their topic, they read it. They process it. They take action.

Producer send. Kafka hold. Consumer receive. That is the basic flow.

Consumer Groups: Sharing the Work

We all have travelled in a train with a big group at some point. Could be a family trip, college friends, or office team. Everyone board same train. But when train reach destination, everyone scatter. Some head to hotel. Some go to relatives place. Some go directly to the event. Nobody wait for each other. Each group know where they have to go.

In Kafka, this is called a Consumer Group.

A Consumer Group is a set of consumers that work together to read from a topic and here is what make it powerful: each partition is read by only one consumer within a group. Work is divide evenly. No two consumers in same group read the same message.

So if your “orders” topic has 4 partitions, and you have 4 consumers in a group, each consumer handle one partition. Parallel processing. Fast. Efficient.

But now here is something even more interesting.

You can have multiple Consumer Groups reading the same topic completely independently.

Your notification service is reading from orders topic. Your analytics service is also reading from same orders topic. Both get every message. One send notification. Other update the dashboard. Neither know about each other. Neither disturb each other.

Same train. Different families. Each going where they need to go.

Why Kafka is Fast and Scalable

One last thing before we wrap this up.

How does a big railway station handle lakhs of passengers every single day without shutting down?

Simple. It don’t rely on one platform. One train. One bogie. It has many platforms running in parallel. Many trains on each platform. Many bogies on each train. Work is spread out.

Kafka work the same way.

More topics for more categories. More partitions for more parallel processing. More consumers in a group to share the load. You need to handle more messages? Add more partitions. Add more consumers. The system scale horizontally.

No single point doing all the work. No bottleneck. No crash.

Any application that survive millions of events per second, trust me, has something like Kafka running quietly in the background doing this job.

Conclusion: You Are the Message

So that evening in cohort class, dots finally connected for me. Backward. Just like Steve Jobs said.

Kafka is not just another message queue , it is a message broker. A central station for your entire application. Scale does not break. Order does not get lost. Multiple teams can read same data. Nobody disturb anyone else.

Everything we went through today, Producers , Topics , Partitions , Consumers , Consumer Groups , that whole journey, it was all happening at that one busy station.

So go back to that station one last time.

This time do not stand outside watching. Step in.

You are not standing outside anymore. You are inside. You are that passenger. You are the message.

Someone dropped you at the station. That was your Producer. You walked to your platform. That is your Topic. You boarded your coach. That is your Partition. You found your seat. That is your offset , your exact position, your order guaranteed.

The train moves. It reaches the destination. Someone is waiting for you there.

That is Kafka.

Next time someone says “we use Kafka for event streaming”, you will smile. Because this time, you know exactly what is happening inside that station.

Continue the Journey with more articles

Still on the platform and enjoyed this ride? Here are more system design trains to catch:

The Thundering Herd Problem: What happens when 25,000 students hit my system at the same time.
Message Queue in System Design: How my server crashed in 60 seconds and what I learned from it.
Cache Strategies in Distributed Systems: The day our cache expired and 25,000 students lost their exam.

Looking for more? All articles are available on my Medium profile and many are coming soon. Follow or subscribe to get notified when the next one drops.

Message Queue in System Design

Deepak Singh Solanki — Sat, 14 Mar 2026 07:02:52 +0000

The Day My Server Gave Up in 60 Seconds

In April 2018, it was our project launch day. I was working on a project for the last 6 months, spending many restless nights to shape it into something I was truly proud of.

We launched at 9:30 AM, and by 9:31 AM, the server had crashed.

Everyone was shocked. No one had a clue what had just happened. I still remember the silence in the room. It was a horrible, heavy silence when something went wrong in front of everyone. Later, we came to know that thousands of OTP requests were hitting our system at the same time. And we were handling all of them one by one. Synchronously. Imagine a single ticket counter and an entire stadium full of people.

That one minute, it changed everything for me.

And the solution? A queue. Same thing you stand in while buying a movie ticket. The same one you curse while waiting for your movie ticket. But inside a system, it decide difference between a crash and calm.

That day in 2018, I wished I had known this earlier. Now you will.

Why Servers Break Under Pressure

Think about WhatsApp for a second.

Every second, millions of people are sending messages to each other. It may be a birthday wishes, memes, voice notes, and office updates. Every single message is a request to handle for WhatsApp’s servers.

Now imagine you are the server.

One request comes in, it’s easy to handle. Ten requests, it’s still fine. A thousand requests make it hard. A million requests in the same second make you collapse.

WhatsApp isn’t the only the one who are facing this problem, all popular apps face it, and you may face the same in the future. The day your app goes from 100 users to 100,000 users, an influencer shares your product, or the day your sale goes live. Suddenly, your server is not handling requests; it is drowning in them.

Do you know the worst part? Most servers are built to handle requests simultaneously i.e. one by one. It finishes the first request, then moves to the next. It works fine until it gets high traffic. Once the moment traffic spikes, everything slows down, requests pile up, and eventually, the server gives up.

I know this feeling as I lived it in April 2018.

Now think about the real question, how do apps like WhatsApp, Instagram, and BookMyShow handle this without breaking? How does their system stay calm and running even millions of people are using it continuously?

That’s exactly what we are going to discuss hereafter.

So What Exactly Saved My System?

We all remember those days when our parents or grandparents used to write letters. Postman collects these letters and everytime deliver the letter to destination. We always know that the letter always reached, without tracking, and without any confirmation.

A message queue works the same way.

In a system, one part of your application writes a message and drops it into the queue, just like giving a letter to the postman. Another part of your application picks it up from the queue and processes it, just like the person receiving the letter. Neither side needs to talk to the other directly. Neither side needs to wait for the other.

This is what developers mean when they say “decoupled architecture.” Neither side knows the other exists and never waits for the other. If one side is busy or temporarily down, then the message just sits patiently in the queue until someone picks it up.

You can consider a message queue as a waiting room for data. Requests are taken one by one. No chaos. No crashes. No one stepping on each other’s toes.

Remember about the OTP crash in 2018? A message queue could have avoid it completely.

Meet the Five Players Behind Every Message

Before we deep dive more, let me first introduce you to the five key players that make this work. We will stick with our postman story, it makes everything click faster.

1. Producer/Publisher: The Letter Writer

We never worried about how our letter would reach. We wrote it, and handed it over to the postman. A producer works exectly the same, it creates a message and drops it into the queue. Producer completed his job. Every message contains the actual data it want to send (payload) and some background details like when it was created and how urgent it is (metadata).

2. Queue: The Post Office

The postman does not deliver letters immediately. It takes time. You can consider it as doctor’s clinic, where a receptionist that takes your information and ask you to wait until someone is free to help. Queue never do any processing, never read the message, it just keep the message safely stored.

3. Consumer/Subscriber: The Letter Receiver

Think about receiving a new iPhone parcel. You walk to door to receive the parcel, open it and start clicking picture. The consumer works the same. It connect to the queue to consume a message, then it puts message to work, processing it, storing it, or triggering whatever needs to happen next. The things finally happens here. Some systems have one consumer reading the queue. Others have ten. Depends on how much traffic your system handles.

4. Broker/Queue Manager: The Postman

Without the postman, letters pile up, and never deliver. The broker is responsible for receiving message from the producer and drops it in the right queue. It make sure that the correct consumer picks it up. If a message lost? He retries. A wrong destination? He reroutes. He manage everything, no one can bypass it.

5. Message: The Letter Itself

Every letter has two things. What’s written inside, that is your actual data, we call it payload. And the envelope, the address, date, stamp, all the details written outside. We call that metadata. The metadata tells the system about its origin, destination & how urgent it is.

The Complete Journey of a Single Message

Now, let’s understand about these players work together. We will follow one message start to finish, the same way a letter travel from origin to destination.

Step 1: Message Creation

Your wote the letter and address, it now ready for post. Similarly, the producer creates a message with the actual data (payload) and some extra details like timestamp and priority (metadata) and the message is ready to travel.

Step 2: Message Enqueue

You hands it to the postman. The producer send the message to the queue and move forward. He did not worried about checking and waiting for furture communication. The message wait there until someone picks it up.

Step 3: Message Storage

The postman collect letters and keep it in bag. Depending upon the system requirement, queue stores the message in memory (if speed matters), or on disk (if you can’t risk losing it).

Step 4: Message Dequeue

The postman reaches the destination and knocks on the door. The consumer grabs the message and starts working on it. Some systems process one message at a time. Others throw ten consumers into the queue simultaneously. Both work fine.

Step 5: Acknowledgment

The receiver signs for the delivery. The consumer sends a signal back saying the message was processed successfully. Acknowledgment is most important, never skip it.

Step 6: Message Deletion

Once getting delivery confirmation, the broker removes the message from the queue permanently. Similarly, you stop tracking you parcel once it recieved. If no confirmation comes back, the It keep retries until someone actually finishes the job. The message stays safe and never lost.

These six steps makes the full journey. I keep thinking about that day. If message queue was there in 2018, those OTP requests would have just waited their turn. No pile up. No crash. This time, no silence in the room. Just a big celebration.

Not All Queues Work the Same Way

Different queues solves different problems. Here are the four types you’ll come across most often.

Point-to-Point (P2P) Queue One sender and one receiver that’ the Point-to-Point queue in the system. Even if multiple consumers are listening, still only one gets each message. Once it’s processed and acknowledged, it’s gone. Good for tasks that should run exactly once, like charging a payment or sending an invoice.

2. Publish/Subscribe (Pub/Sub) Queue

One sender. Many receivers. That’s it. Producer drop a message to a topic and forget about it. Every subscriber of that topic get their own copy. The sender has no idea who’s listening and doesn’t need to. When you place a Swiggy order that one event notify your app, the restaurant, and the delivery partner all at once. That’s Pub/Sub.

3. Priority Queue

Some messages are more important than others. A payment failure alert can’t wait behind a promotional email? Priority queue solve this problem. You assign urgency. Critical goes first and the rest waits.

4. Dead Letter Queue (DLQ)

Sometimes a message just refuse to process. Wrong format, failed retries, some unexpected error. These broken messages, if you leave them in main queue, they block everything else. So system moves them to a separate place, we call it Dead Letter Queue.

I have used this more times than I want to admit. It let you investigate what went wrong, fix it, and move on. Main queue never get disturbed.

You Are Already Using This Every Day

Message queues are already part of our daily life. You just didn’t realize it. Any app that survive high traffic, trust me, has a queue somewhere in background doing its job quietly.

1. WhatsApp Messages

Send a message, get one tick, then two. That first tick indicate that your message reached to the queue. The second ensure that it reached your friend. That small delay shows that a message queue is working in background to deliver this mesage.

2. Swiggy / Zomato Orders

Place an order, and within seconds, the restaurant gets a notification, the delivery partner gets assigned, and you get a confirmation. That’s not happening simultaneously. Queue that your order and route it to each party one step at a time.

3. OTP Messages

Sometimes it takes time for the OTP to arrive, because thousands of users requested it at the same time. The system adds every request to a queue and sends them out in order. In 2018, we didn’t have this. Our system tried to process all OTP requests at once, and the server collapsed in 60 seconds.

4. Email Notifications

A company sends promotional email to five million users, but they never hit send for all five million at once. Every email goes into queue first, then process in batches. Steady and Controlled. No spike.

5. BookMyShow Ticket Booking

IPL ticket available on BookMyShow. Million people trying to book ticket at exact same second. Without the queue, server crash in seconds. With queue, requests just line up. Some people still don’t get tickets, but atleast the site don’t crash. That’s the win.

Which Tool Should You Actually Start With?

After the 2018 crash, I start learning about the message queue tools. I have personally worked with RabbitMQ and SQS. Both are solid. But there are other options too depending on your use case.

1. RabbitMQ: Start Here

Simple. Well documented. Easy to run locally. I built my first message queue with RabbitMQ, and honestly it taught me a lot. If you are just starting out, this is where you begin. Works well for email and notification systems. Small to medium projects with flexible routing? RabbitMQ is good fit for that. Not as powerful as Kafka. Not as managed as SQS. But open source, and it adapt to almost anything.

2. Apache Kafka: When Scale Gets Serious

Kafka is different. You throw millions of messages per second at it. It just keep going. Don’t even slow down. Swiggy, Ola, real time analytics systems, all running Kafka somewhere in background. It take time to learn. But once you understand it, you see why big systems trust it.

3. Amazon SQS: For AWS Teams You don’t need to manage any servers and you don’t have setup headaches. AWS will do it for you. If you’re already on AWS, SQS is the best choice. You just need to pay for what you use and keep scaling automatically. Let AWS worry about the infrastructure.

4. IBM MQ: For Systems That Can’t Lose Messages

Banks and large enterprises trust this. Because in financial systems, losing even single message is not an option. Yes it’s expensive. Yes it’s complex. But when money is involved, that tradeoff make complete sense.

5. Apache ActiveMQ (Artemis): Middle Ground

Working on small or medium project with some routing needs? ActiveMQ is a good fit for that. Not as powerful as Kafka. Not as managed as SQS. But open source, and it adapt to almost anything.

6. NATS: For Speed

Lightweight yet extremely fast. If your system needs low latency and doesn’t need to persist messages, NATS is worth a look.

Powerful, But Not Without a Cost

Benefits

Message queues do a lot more than just prevent crashes. Here’s what you actually gain.

Decoupling : Producer and consumer, they never need to know each other exist. You want to update one side? Go ahead. Replace it? Fine. Scale it? No problem. Other side never get affected. I have seen this save so much pain in teams where multiple developers are working on same system. Everyone work independently. No one stepping on each other’s toes.

Scalability : Sudden traffic spike? Add more consumers. They pick up from queue on their own. Load gets shared automatically. You don’t even need to redeploy anything.

Reliability : If consumer crashes mid-processing, then the message stays in the queue. When it recovers, it picks up where it left off. We learned this the hard way in 2018.

Async Processing : The producer drops message and keep moving. Don’t wait for anyone. User get response immediately. Background task finish on its own later. This is why your app feel faster than it actually is.

Tradeoffs

The message queues are not the perfect solution. They come with real costs.

Complexity : A message queue is one more thing to manage. Queue, broker, consumers, all running separately alongside your actual application. For small projects, honestly ask yourself, is this complexity really worth it? Sometimes simple is better.

Debugging is Hard : In synchronous systems debugging are easy, so when something breaks, its easy to trace. But async systems have a different story. Its takes a time to find exact problem, when something goes wrong inside a queue. I have spent some very long nights because of this. Trust me on this one.

Message Ordering : Message ordering may be change in queue, because most queues don’t guarantee messages arrive in the order they were sent. If your system depends on sequence, you need to plan for this upfront.

Extra Cost : More infrastructure means more money. Managed services like SQS or Kafka on Confluent aren’t free. Factor this in early.

Message Duplication : If an acknowledgment gets lost, then the same message can arrive more than once. Your system needs to handle this gracefully, or you’ll process the same request twice.

What That 60 Second Crash Taught Me

April 2018 was embarrassing. Six months of work, a room full of people watching, and the server went down in sixty seconds.

But I’m glad it happened.

That one failure teach me the actual understanding about how large systems handle scale. It introduced me to message queues and honestly, it made me a much better developer than I would have been otherwise.

If you’re building something that real users will touch, you need to understand this. You don’t have to start with Kafka. Start with RabbitMQ on a small side project. Build, Break and Fix. That experience will teach you more than reading about it ever could.

Including this article. 😄

Next time you see that single tick on WhatsApp or watch your Swiggy order get confirmed in three seconds, Don’t forgot to thank a queue.

Cache Strategies in Distributed Systems

Deepak Singh Solanki — Sun, 08 Mar 2026 18:11:34 +0000

Introduction

In 2021, during a mock exam for 25,000 students, our system crashed. The reason was not just high traffic. All our cache keys expired at the same time. Every request hit the database directly. The database slowed down. The exam failed before it even started.

That day taught me one thing: caching alone is not enough.

In distributed systems, cache is your first line of defence. It stores frequently accessed data in high-speed memory, reduces database load, and improves response time. But if you don’t manage it properly, cache itself becomes the reason your system goes down.

In this article, I will walk you through why basic TTL breaks under pressure and what I learned about TTL Jitter, Probabilistic Early Re-computation, Mutex Locking, Stale-While-Revalidate, and Cache Warming the hard way.

Why Basic TTL Is Not Enough

When I first started working with caching, TTL felt like a complete solution. Set an expiry time, cache refreshes automatically. Simple and clean.

But here is the problem. TTL does not care about what is happening in your system. It does not know if it is 2 AM with zero traffic or an IPL match day with millions of users online. It just expires. Every single time.

In a single server system, this is manageable. But in distributed systems, you are not dealing with one cache key. You are dealing with thousands of them. And when all of them share the same TTL, they all expire at the same time.

That is exactly what happened with us. Our cache TTL was set to expire between 5 PM to 6 PM every Sunday. The same time our exam started. 25,000 students hit the system, cache expired simultaneously, and every request went straight to the database.

Basic TTL has one job: expire the cache. It does that job well. But it has no strategy for what happens next. No coordination. No awareness. Just expiry.

That gap is where distributed systems break.

How Cache Expiry Causes Traffic Spikes

A request comes in, server looks for data in cache, finds it, and sends the response back. The database is never involved.

But what if thousands of cache keys expire together? Suddenly, every request finds an empty cache and hits the database directly. Connection pool fills up. Queries start queuing. Response time increases.

Users see a slow response and start refreshing. More requests. More load. More failures. The system keeps struggling until it crashes.

On our exam day, this is exactly what happened. 25,000 students hit the database directly. Students refreshed. The system could not recover in time.

This chain reaction is called Thundering Herd. And basic TTL has no answer for it.

TTL Jitter

Different roads get a green signal at different times to avoid all traffic moving simultaneously. If all roads get a green signal at once, it results in chaos. Same thing happens when all cache keys expire at the same time. The database gets hit by thousands of requests together and the system struggles to recover.

TTL Jitter solves this by adding a small random value to each cache key’s expiry time. Some keys expire a little earlier, some a little later. No two keys expire at exactly the same time. This spreads the database load across a window instead of hitting it all at once.

It is a small change but it makes a big difference during high traffic events like IPL match day or a Big Billion Day sale.

Probabilistic Early Re-computation

A car fuel indicator does not wait for the tank to go empty before warning you. It alerts you early, while there is still enough fuel to reach a petrol pump. You always refill it before it goes empty.

Probabilistic Early Re-computation works the same way. Instead of waiting for a cache key to fully expire, the system starts refreshing it early. But not always. It uses a probability calculation to decide whether to trigger the refresh. The closer the cache gets to its expiry, the higher the chance of early refresh. This way, the cache is always ready before it expires, and the database never gets a sudden spike of requests.

This is especially useful during high-traffic events. When millions of users are hitting the same cache key, you cannot afford to let it expire. Probabilistic Early Re-computation ensures the cache is silently refreshed in the background before expiry hits, and the thundering herd never gets a chance to start.

Mutex / Cache Locking

Imagine a busy grocery store with a single billing counter. When one customer is being billed, others wait in the queue. Nobody jumps the counter to bill themselves. One at a time, in order.

Mutex / Cache Locking works the same way. When cache expires, only the first request acquires the lock and hits the database to regenerate the cache. All other requests wait in the queue till the cache is ready. Once the cache is refreshed, the lock is released and all waiting requests get the data directly from the cache.

One important thing. Always set an expiry on the lock. If the request that acquired the lock crashes, the lock must auto-release. Otherwise, all waiting requests will be stuck forever and your system will freeze.

Stale-While-Revalidate

The government announces a petrol price hike from today’s midnight. But you can still buy petrol at old price as price is not updated yet and new price applies from tomorrow.

CDN is another example. Platforms like Cloudflare continue serving cached content to users while fetching fresh content from the origin server in the background.

Stale-While-Revalidate works the same way. When a cache key expires, the system continues to serve old cached data and refreshes cache in background at the same time. Once the new data is ready, future requests get the updated cache.

This ensures users never face a slow response because of cache expiry. The system stays responsive even during cache regeneration. No waiting. No database spike. No thundering herd.

Cache Warming / Pre-Warming

Before every IPL match, Hotstar engineers start preparing the servers for live streaming to millions of users. They load match data, player stats, team lineups, and streaming configurations into cache well before the first ball is bowled. Everything is already ready before users open the app.

Netflix does this every night. They pre-compute the homepage for every user profile and load it into cache before you even open the app. By the time you login, your homepage is already ready.

This is called Cache Warming. Instead of waiting for the first request to hit the database and build the cache, the system proactively loads the cache before traffic arrives. This ensures the cache is already hot and ready to serve before users start coming in.

Without cache warming, the first wave of users after a big event launch hits an empty cache. Database gets slammed. The system slows down. That is exactly the thundering herd situation we want to avoid. Cache warming ensures the herd never finds an empty cache.

Tradeoffs

Every caching strategy is a tradeoff between Freshness, Speed and Consistency. Freshness means how up-to-date your cached data is. Speed means how fast your system responds. Consistency means how uniform the data is across all users at the same time.

If you want fresh and consistent data every time, your database may slow down under high traffic because you are not serving older data from cache. If you want to reduce latency and maintain consistency, caching helps but may not always serve the freshest data. If you want high consistency with reduced latency, you may need to serve slightly older data until the cache is updated.

When to Use Which Strategy

There is no thumb rule for choosing a caching strategy. It totally depends on your use case, your traffic patterns, and how critical data freshness is for your system.

Here is a simple guide to help you decide:

If your system has many cache keys expiring around the same time, start with TTL Jitter. It is the simplest fix and works well as a default for all distributed systems.
If you have hot keys that are hit by millions of requests, use Probabilistic Early Re-computation. It ensures cache is always ready before expiry hits.
If rebuilding cache is expensive and you cannot afford duplicate recomputation, use Mutex / Cache Locking. One request rebuilds, others wait.
If your system is read-heavy and you can tolerate slightly old data, use Stale-While-Revalidate. Speed is the priority here.
If you know a traffic spike is coming, like an IPL match or a sale day, use Cache Warming. Prepare before the herd arrives.

And remember, these strategies are not mutually exclusive. In real systems like Hotstar or Amazon, engineers often combine multiple strategies together. For example, TTL Jitter with Cache Warming before a big event, or Mutex with Stale-While-Revalidate for read-heavy APIs.

Conclusion

After that failed exam, we made two important changes in our cache strategies. We started using multiple Cache TTL values and set them to expire at different times to avoid simultaneous expiry. We also pre-warmed our servers before every big event to ensure cache was ready before traffic arrived.

These two changes made all the difference. We tested the system for 30,000 students and this time, everything worked smoothly.

That experience taught me that caching is not just a performance tool. In distributed systems, how you manage cache can be the difference between a smooth experience and a total system failure.

Start simple. Use TTL Jitter as your default. Add more strategies as your system grows. And always prepare before the spike hits, not after.

The Thundering Herd Problem

Deepak Singh Solanki — Sun, 08 Mar 2026 17:06:53 +0000

The Thundering Herd Problem

Introduction

In 2021, I was working on an EdTech ERP system. COVID had pushed everything online, and all institutions were facing issues related to online education. We built this system and deployed it on the AWS cloud. We were using horizontal scaling techniques to handle high traffic. We set up a load balancer , Docker, auto scaling , cache to make sure that the server can handle high load.

The client was expecting 10000 student logins, so we had done load testing for 10000 users as asked by the client. After this testing, we decided to organise a mock test with real students on Sunday at 5 PM. We were ready.

The exam started at 5 PM, and we got a traffic of 25000 students simultaneously. Still, everyone was cool as we had already set up the auto scaling for the server. But we forgot one thing. Auto scaling isn’t instant and the server scaling time can vary from seconds to minutes, and the server went down. After 5 minutes, the server was ready to handle this load, but another problem hit us. All the cache expired simultaneously as the cache TTL was set to expire between 5 PM to 6 PM every Sunday.

So 25,000 students hit the system, and the cache expired simultaneously, so all requests went straight to DB , and DB slowed everything. The exam hadn’t even started properly. We were frantically checking logs, and students were refreshing their screens.

The Thundering Herd Problem

Before going to the technical definition, we need to understand how a user reacts to online sale opening. Everyone is ready with their wishlists, fingers on the ‘Buy Now’ button, so websites face high traffic. But when the website starts slowing down, then the user continuously starts refreshing the page. So here is the situation: the website is slow because of high load and everyone refreshes the page to get faster response which increases the load even further and it may lead to system failure.

This is exactly what happens inside your system too. When thousands of requests hit the system at the same time, the system can’t handle it. That’s the Thundering Herd Problem.

This problem commonly occurs in three places in your system.

Load Balancer / Auto Scaling: We use Load Balancer to distribute the incoming traffic between all available servers. Let’s say if you set up three servers, then the load balancer will make sure that traffic can properly route to all servers without sending all traffic to any one of them.

But when traffic suddenly spikes, then available servers are not able to handle all traffic. The system tries to scale by adding more servers. But auto scaling isn’t instant. Those new servers take time to spin up: sometimes seconds, sometimes minutes. In that gap, your existing servers are taking all the hits alone.

This is exactly what happened with us. 25,000 students hit the system, auto scaling triggered, but the server went down before new servers could join.

Cache: Cache is used to store frequently accessed data in high-speed memory (e.g., RAM) to reduce database load and improve latency. It allows the system to access data faster. Each cache has a TTL which tells about the time when cache expires and must be refreshed. When a cache expires, all the data requests directly hit the database.

If the cache expires one by one at different times, the database can handle it easily. But if all the cache expires at the same time, all incoming requests find empty cache and hit the database that spikes database load.

That’s exactly what happened with us. Our cache TTL was set to expire between 5 PM to 6 PM every Sunday, the same time our exam started.

Database: A database contains a large amount of information, some data is frequently requested by clients. We already discussed that we use cache to store frequently accessed data. When the cache is working properly, these requests never reach the database. Cache handles them directly. It keeps the database free for heavy or complex queries.

But when cache is not available, all requests hit the database directly, causing a sudden, massive spike in identical queries, leading to severe performance degradation or total system slowdown/failure.

This same thing happened with us on the exam day. Cache expired simultaneously and all student requests hit the database for the same exam data, which eventually slowed down the system.

Real-world Example

We already discussed how the Thundering Herd Problem affected our mock exam. Now let’s look at some more real world examples that will help you understand this problem better:

Live Streaming: On India vs Pak match day, billions of users are watching this match online. On these important match days, platforms receive months’ worth of traffic in a single day. So, their system must be ready to handle this simultaneous load to prevent their system from getting down.
Fantasy apps: At the time of toss, millions of people are active on their fantasy apps to build their teams to win big rewards. This requires fantasy app servers to be ready to handle this simultaneous, high-volume traffic.
UPI / Payment Gateway: Nowadays, everyone likes to pay digitally. These payment applications handle millions of transactions every second. This makes these systems vulnerable to going down. Any sudden traffic spike can bring down these systems instantly. I am sure you have experienced this during peak hours, salary day or in festival season.
Tax Filing: We all need to file taxes to the government. Also, we need to submit this information online using government portals. Now think about the traffic, when everyone is trying to file their income tax / GST returns before the deadline. This high-volume traffic slows down systems and sometimes brings it down for some time.

How Traffic Spikes Overload Systems?

Before discussing this, first we need to understand that all system components like CPU , memory , database , disk , bandwidth have a limited capacity. It cannot be infinite. More resources definitely increase system capacity, but the system always reaches bottlenecks.

In a normal scenario, a system is designed to handle a fixed number of requests. When a request arrives, the server first checks the cache. If the data exists in cache, it returns the response immediately. Otherwise, the request goes to the database.

When a sudden traffic spike hits the server simultaneously, the system first tries to handle them with the available resources. The server also starts scaling when necessary.

Now if cache expires during this high traffic, then all the requests start hitting the database , which eventually overloads the connection pool and query responses slow down because of excessive queries.

During this time, users get delayed responses and sometimes it can lead to timeouts. Clients start retrying because of failed or slow response. This leads to additional traffic to the system.

This creates a dangerous cycle: more traffic, more retries, more load. It continues until the system slows down or crashes completely.

Why is it Dangerous in Distributed Systems?

In a single server system, the Thundering Herd Problem is bad. But in distributed systems , it becomes truly dangerous.

In distributed systems , multiple services depend on each other. When one service slows down, all dependent services start struggling too. Threads get blocked, memory fills up, and connection pools get exhausted. New requests find no available resources and start failing. This creates a chain reaction:

Failure → Retry → More Load → More Failure.

What started as a simple cache expiry now spreads across every service in your architecture. A small synchronization issue becomes a full system breakdown. This is why distributed systems must be designed assuming failures will happen and must be contained before they spread.

Normal Spike vs Thundering Herd

Not every traffic spike is a Thundering Herd. Here is the difference:

Normal spike: When traffic spikes gradually, the system can predict this and manage it accordingly. It also provides a breathing window for the system for auto-scaling.

Thundering Herd: When high-volume traffic spike hits the server simultaneously, the system struggles to handle the traffic. It forces the system to work at peak capacity and there is no breathing window for auto-scaling that leads to total failure.

Impact on System Components

Let’s understand the impact of the Thundering Herd Problem on different system components:

CPU: When the herd hits, CPU utilization spikes to 100%. The processor starts struggling with slow processing and high context switching which increases response time and the system tries to scale by adding more resources, but by then, the damage is already done.

Database: The database starts receiving a high volume of identical queries simultaneously which exhausts all available connections. This leads to lock contention and possible deadlocks. Queries wait in a long queue, resulting in slow query responses and sometimes complete database failure.

Cache: When the Thundering Herd hits, thousands of requests miss the cache at the same time and every one of them tries to regenerate the same data simultaneously. This creates massive CPU and network pressure on the cache server, causing memory spikes and rapid eviction cycles. Instead of protecting your database, your cache becomes part of the problem.

Load Balancer: The load balancer struggles to distribute requests evenly as servers become unresponsive or slow to reply.

Latency: All of the above situations directly impact latency. Slow cache, overloaded database, and exhausted CPU, everything adds delay to every single request. Users start seeing slow page loads, timeouts , and failed requests. What normally takes milliseconds can take seconds. This is when users lose trust in your system.

Each of these situations can contribute to a complete system breakdown.

Prevention Techniques

Once we understand its impact, now let’s look into the prevention techniques:

Request Coalescing: Remember the notice board in school? When the principal wants to inform everyone about the upcoming exam, he just publishes a notice, not informing each student individually. In system design, it’s called Request Coalescing. Only the first request hits the database and all other requests wait. Once the response is ready, it’s shared with all. It’s simple, but not effective for distributed systems.

Cache Locking / Mutex: Think about the biometric machine in your office. When your colleague is punching attendance, others wait till it’s done. Cache Locking works in the same way. When cache expires, only the first request acquires the lock and hits the database to regenerate the cache. All other requests wait till the cache is ready. The lock is released and all waiting requests get the data directly from the cache. It’s important to always set the expiry on the lock. If the request that acquired the lock crashes, the lock must auto-release. Otherwise, all waiting requests will be stuck forever.

Stale-While-Invalidate: Got the salary credited message, but banking app showing the old balance. Wondering what happened? It’s not a system issue, balance updating in background. Correct balance shows after some time. Banks use this technique to ensure a smooth experience. In system design, this is called Stale-While-Invalidate. It ensures all the requests continue to get the data without hitting the database. Cache is updating in background, once the cache refresh completes, new cache serves future requests. This approach reduces latency , prevents spikes on the backend, and ensures smooth traffic even during cache regeneration.

Staggered Expiry: Different roads get green signal at different times to avoid all traffic moving simultaneously. If all roads get green signal at once, then it can result in a mess. Similarly expiring all cache at once triggers a Thundering Herd. In the Staggered Expiry technique, the system uses expiry TTL with a random factor for all cache keys. It ensures all cache keys don’t expire at once and keeps the database safe from simultaneous queries. It reduces spikes in load and helps maintain system stability.

Exponential Backoff: You purchase an iPhone and try to make a payment. But the bank server is down. You wait for a minute, before retrying. If it still does not work, then you wait for a few minutes before retry. This small random delay gives the server time to recover. The same technique is known as Exponential Backoff in system design. This technique is used to limit the number of retries when a server is under the load. Instead of retrying immediately after a failed request, each subsequent retry doubles the delay by an exponentially increasing interval. A small random jitter is also added to each retry to avoid all clients retrying at the exact same time. Without this, all failed requests retry simultaneously that add more load on an already struggling server.

Rate Limiting: We all have experienced OTP delays on websites. Sometimes, we need to wait before requesting another OTP. That’s Rate Limiting in action. In system design, Rate Limiting controls the number of incoming requests to prevent system overload. During a thundering herd, it acts as the first line of defence. It blocks excessive requests before they reach the database. Because it’s always better to reject some requests than to crash the entire system.

What We Did to Fix It

After the failed exam, we decided to analyse our system design and list all the bottlenecks. After checking this carefully, we identified the issue and started working on solutions:

First we decided to pre-warm our servers, before any big event for at least 2x the expected load, which ensures that the server is ready to handle requests from high-volume traffic. Also, reconfigured auto scaling triggers to 70%, which was previously set as 90%.
We decided to use multiple Cache TTL and set it to expire at midnight to avoid such situations again.
We also worked on our indexing and cache optimization.
After completing these changes, we tested the server for 1 lakh users, recorded its results and built a flow chart to handle these situations better in future.
Finally, we set up Slack/email alerts , so team members can get alerts before things get worse.

Things take time. But, this time we were ready. Next Sunday, we conducted a successful exam for 30,000 students.

Conclusion

A Sunday in 2021 taught me some important lessons about system design and gave me a challenge to solve it. I took this challenge and solved it and you also can solve this just by knowing the bottlenecks in your system. Things that I learned on that evening are:

Thundering Herd is not just a traffic problem, it’s a synchronization problem.
Every system has a breaking point, we just need to know before it happens.
Sometimes small misconfigurations (like a wrong TTL ) can bring down entire systems.

So, what we need to consider while designing a system:

Always test the system with its peak limits , which tell you about system breakpoints.
Always track traffic to understand system load.
Using a proper cache TTL can save your system from going down.
Before any big event, check if server pre-warming is required. Don’t wait for the herd to arrive.

The herd will come. The question is: will your system be ready?