In 2020, COVID had changed everything. Schools and colleges were shutdown and education moved online. Classes, attendance, assignments, exams, fees — everything was online.
Our client had built their ERP years before COVID. It was working fine. They had onboard many schools and institutions. But COVID pushed the usage to a different level. Teachers who were used to teaching in classrooms were now taking online classes. Students who were used to sitting in classrooms were now attending from home. Using the ERP was no longer a choice, it was mandatory.
The system started breaking.
Everyone was getting logged out randomly. Teachers lost work in middle of sessions. Students could not access homework or online classes. Even the staff was struggling with their daily work.
Client’s tech team started working on it. They checked everything. Load balancer config, server health, database. Everything looked correct.
They could not find the problem.
That is when they called us. For expert advice.
I remember looking at their architecture for the first time. Three servers and Nginx load balancer on top. A perfect setup. I also agreed with their team. The design was correct. Nothing was obviously wrong.
Everything was correct and still the system was failing.
That is the day I came to know something I will never forget.
Most developers never think about small configurations, but that can silently break the system completely.
That configuration was about the session and that story is what I want to share with you today.
The Investigation
When we got access to their system, we started from scratch. No assumptions.
We checked with the Nginx load balancer configuration first. Round robin was set up correctly. All three servers were getting equal traffic. We did not find any issue there.
Then we checked the servers. CPU, memory, logs. Everything was healthy. There were no crashes, no errors and no unusual spikes.
Then we checked the database. Queries were fine. No slow queries and no connection issues.
We spent almost a week and were clueless. Load balancer, servers, database, deployment config. Nothing was wrong. The system was built correctly.
Then we moved to the code.
We started checking the authentication and most complained endpoints one by one. Login, dashboard, fee management, homework. Suddenly, something clicked in my mind.
Sessions.
The application was using Laravel session-based authentication. I checked and found that sessions were not stored in the database or Redis. They were stored locally on each server.
That was it. That was the problem.
The Root Cause
Now, Let’s discuss about what was actually happening.
When a user logs in to a Laravel application, the server creates a session. That session stores the user’s identity. Who they are, what they can access. By default, Laravel stores this session in files on the same server.
They setup three servers behind a load balancer.
User visits the website. Request goes to Server A. Server A authenticates the user and creates a session. Session is stored locally on Server A. User is logged in.
Next request from the same user. Load balancer sends it to Server B. Server B has no session for this user. It does not know who this person is. So it throws them out.
That is it. That is why everyone was getting logged out randomly. It was not because of any bug in the code and not because of any misconfiguration in the load balancer. The load balancer was doing exactly what it was supposed to do. Distributing traffic across all three servers.
The problem was that each server was living in its own world. Servers had no shared memory and no shared session. Three servers working as three separate brains.
This is what we call a stateful system.
The Analogy
Let me explain this with something we all have experienced.
Think about a guard standing at school gate. He knows students by face and remembers everyone. When a student arrives, he allows entry without checking id card. No id card required. The guard remembers.
Now that guard retires.
New guard joins next day. He does not know anyone. Rahul comes to school. New guard stops him. Rahul says he is a student here. But new guard has no memory of Rahul. No record. Nothing. So Rahul is not allowed inside even though he is a valid student.
This is exactly what was happening in their system. Server A was the old guard, it remembered the user. But load balancer sent next request to Server B, server B was the new guard. He had no memory and user gets kicked out.
This is a stateful system. The server remembers you. But that memory is private. All other servers will treat you as stranger.
Now think about a different school that have fingerprint machines at every gate. Gate 1, Gate 2, Gate 3. It does not matter which gate you use. Machine scans your fingerprint and verifies it instantly. Machine does not need to remember you. Your fingerprint works the proof.
This is a stateless system. The server does not remember you. But every request have enough information to verify you. No memory needed. Any server can handle your request.
That is the difference.
What is a Stateful System?
A stateful system remembers the client. Every time a request comes in, the server uses the previous information to handle it.
Session based authentication is the most common example of this. When you log in, server creates a session. That session stores who you are. Every next request, server checks that session and responds accordingly. Server is keeping your state.
This works perfectly fine for single server. One server, one memory, no confusion.
But when you add more servers, problem starts. Each server has its own memory. They do not share it. So if Server A created your session and Server B receives your next request, Server B has no idea who you are.
That is the stateful problem in distributed systems.
Some common examples of stateful systems:
Laravel and PHP session based authentication. WebSocket connections to tracks connection state. Online multiplayer games to remembers player position and score.
Stateful systems are not wrong. They are just not built for horizontal scaling.
What is a Stateless System?
A stateless system does not remember anything. It treat every request as new and every request must carry all the information needed to verify and process it.
JWT is the most common example of this. When you log in, server creates a token. That token contains your identity, your role, expiry time. Everything. Server does not store anything. Next request you send, you attach this token. Server reads the token, verifies it and responds. No memory needed.
Any server can handle any request. Server A, Server B, Server C. It does not matter. Every request itself have all the required information.
This is why stateless systems scale so well. You can add 10 servers or 100 servers. Every server can handle every request equally without having any shared memory or shared session sync.
Some common examples of stateless systems:
REST APIs with JWT authentication. AWS Lambda functions. Microservices communicating over HTTP.
Stateless systems are not always better. But when you need to scale horizontally, stateless is the right choice.
The Fix
Once we identified the problem, we explained it to the client team.
They were surprised first. Then they laughed. All this time, three servers, load balancer, proper architecture. And the problem was just session storage. That was it.
We decided to fix it in two steps.
First, we moved the sessions to database. Laravel supports database as session driver out of the box. One configuration change. Now all three servers were reading and writing sessions from the same database. Server A creates session, Server B can read it. Server C can read it. No more random logouts.
This was a quick fix. It worked. But we knew it was not the final solution. Every request was now hitting the database to validate the session. That adds load on database. Not a good idea for scale.
So we planned the second step.
We replaced session based authentication with JWT. Server creates a JWT token on login. Token goes to the client as response and client sends the token with every requests. Server verifies it and responds. No database call needed for validation. No session storage needed. Completely stateless.
But we did not stop here.
The client had multiple applications. Web application, student app, teacher app and many more. Each app had its own login with separate unique sessions and separate auth. We saw an opportunity to fix this properly.
We built a SSO. Single Sign On. One central authentication server to authenticate users on all applications through it. Login once, access everything.
We deployed SSO on a separate server with JWT. Now it was Stateless, Horizontally scalable and ready for any load.
Testing
Once SSO was ready, we did not go live directly.
We tested it with JMeter first and simulated thousands of concurrent users hitting the system.
We test it on Login, dashboard, fee management, homework and other endpoints. Everything worked. Not a single failure. No more random logouts and No session errors. Every request was going to right place.
After this testing, we went to the real users.
We contacted the students, teachers and school staff who had complained earlier and asked them to use the beta system. We collected their feedback one by one and everyone was happy with new system. No one faced any random logouts and system was working fine.
SSO Integration
Once testing was done, we started integrating SSO with all client applications.
Web application was first, then student app, then teacher app and so on. One by one, we replaced their existing auth with our SSO. Every application was now authenticating through one central SSO server. Each system had same JWT token and same stateless flow.
The client had many schools and institutions onboarded. All of them were now using the same SSO. One login for everything.
This also opened new possibilities for the client. Adding a new application in future was easy now. Just integrate with SSO and authentication is done. No need to build auth from scratch every time, just few lines of code make authentication ready.
The system that was failing with 1000 concurrent users was now ready to handle much more. We did not added more servers, just fixed the root cause. Stateless architecture with central auth.
If you want to read about what happened next with this system, I have written about it in my article on the Thundering Herd Problem. That story starts exactly where this one ends and teach you how a Perfect System can fail.
Conclusion
That 2020 incident taught me more than any course or tutorial ever did.
We had a perfectly designed system with three servers and load balancer. Everything correct and still it was failing. It was not because of bad architecture, but because of one small configuration that nobody thought about.
Session storage.
That is the thing about system design. You can design a perfect system, but still can miss one small detail that breaks the system. This is why every decision matters and need a careful consideration. Where you store sessions, how you authenticate users, and how your servers communicate. Every small thing has an impact at scale.
If you are building a system today with multiple servers, ask yourself one question. Is my authentication stateful or stateless? If it is stateful, make sure your sessions are shared. Database, Redis, anything. Just not local server storage.
And if you are planning to scale, move to stateless. JWT, SSO, whatever fits your use case. Your will thank yourself in future self for this decision.
The guard will retire someday. Make sure your system does not depend on his memory.
Continue the Journey with more articles
Still on the platform and enjoyed this ride? Here are more system design trains to catch:
- Kafka Explained Like You’re 5: How I finally understood Kafka after years of avoiding it.
- The Thundering Herd Problem: What happens when 25,000 students hit my system at the same time.
- Message Queue in System Design: How my server crashed in 60 seconds and what I learned from it.
- Cache Strategies in Distributed Systems: The day our cache expired and 25,000 students lost their exam.
Looking for more? All articles are available on my Medium profile and many are coming soon. Follow or subscribe to get notified when the next one drops.






Top comments (0)